mirror of
https://github.com/velocitatem/PHANTOM.git
synced 2026-05-31 08:33:36 +00:00
feat(paper): mentining how we using H/A and the finall outputs
This commit is contained in:
@@ -183,7 +183,7 @@ Since users act with motivations, we define a pool of tasks (jobs to be done) an
|
||||
The task pool is stored as a structured table with fields \texttt{id}, \texttt{created\_at}, \texttt{task\_name}, \texttt{task\_description}, and \texttt{task\_def\_of\_done}. We formulate the tasks as compact jobs-to-be-done rather than as strict click scripts, because the target is to elicit realistic browsing and comparison behavior which can capture nuance of different people. In hotel mode the assigned tasks include \textit{Cheapest Room}, \textit{Cheapest Room w/ View}, \textit{MultiStep Cheapest Room}, \textit{The Digital Nomad (Executive)}, and \textit{The 3-Way Tradeoff (Desk + Quiet + Flexible)}. These prompts deliberately require critical thought in search, inspection of room details, comparison of amenities or images, return visits to the listing page, and a final booking decision which create a degree of cognitive load. In airline mode we use \textit{Last-Minute One-Way Flight}, where the actor must urgently travel to LAX from either SEA or JFK within the next 1--3 days, inspect at least a small set of candidate itineraries, and then book a reasonable earliest departure.
|
||||
A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor.
|
||||
|
||||
The human data collection involved 18 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 18 human sessions we ran 18 agent sessions of equivalent task scope, giving a balanced dataset of 36 labeled trajectories. Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.
|
||||
The human data collection involved 13 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 13 human sessions we ran 16 agent sessions of equivalent task scope, yielding 29 labeled trajectories in total (45\% human, 55\% agent). Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.
|
||||
|
||||
To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.
|
||||
|
||||
@@ -207,8 +207,8 @@ The dynamic pricing mechanism elicited immediate behavioral adjustments. Partici
|
||||
|
||||
\subsubsection{Design of Training Factorial Study}
|
||||
|
||||
The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Statistical power for the behavioral comparisons is determined by a two-sample test over per-session KL divergence scores; a formal power analysis with minimum detectable effect size at $n=18+18$ is reported in the results.
|
||||
% Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n=18 per group. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
|
||||
The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Statistical power for the behavioral comparisons is determined by a two-sample test over per-session KL divergence scores; a formal power analysis with minimum detectable effect size at $n_H=13$, $n_A=16$ is reported in the results.
|
||||
% Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n_H=13 and n_A=16. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
|
||||
While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
|
||||
|
||||
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
|
||||
@@ -496,8 +496,11 @@ The algorithm operates in discrete epochs indexed by $t$. At each epoch, the pla
|
||||
|
||||
%The defensive price update in Line 24 implements contamination-aware margin shrinkage: as estimated contamination $\hat{\alpha}_t$ rises, the margin $(p^{\mathrm{ref}} - c)$ is reduced by factor $\kappa\in[0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring feasibility. In subsequent experiments this heuristic rule is replaced by DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}.
|
||||
|
||||
\subsubsection{Computational Cost Analysis of the Simulation Step}
|
||||
\subsection{Parallelization Strategy}
|
||||
|
||||
To avoid preemption of compute mid-training we settle on using a v4 generation, 40 chip compute node with 5 parallel workers. The login node creates an orchestration node with Ray and we distribute ray compute nodes per each other worker.
|
||||
|
||||
\subsubsection{Computational Cost Analysis of the Simulation Step}
|
||||
The per-step cost of Algorithm~\ref{alg:phantom_loop_clean} is not uniform across its components. To inform hardware provisioning and to identify where algorithmic improvements are most impactful, we profile the hot path of the engine using Python's \texttt{cProfile} instrumentation over 20 environment steps under two configurations: a baseline with the robustness inner loop disabled ($K=1$, $\epsilon_\alpha=0$) and a standard robust setting ($K=5$, $\epsilon_\alpha=0.2$). Both runs use $M=10$ sessions per market call and $N=3$ products.
|
||||
|
||||
The baseline achieves approximately 26 steps per second. Enabling the robustness inner loop with $K=5$ candidates drops throughput to 7.2 steps per second, a $3.6\times$ slowdown that is directly proportional to $K$, consistent with the $O(K)$ scaling of the adversarial alpha selection in the implementation.
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
\subsection{Behavioral Analysis}
|
||||
|
||||
Separability between human and agent sessions is evaluated by computing per-session divergence gap scores $\Delta_{H,s} - \Delta_{A,s}$ and comparing the two groups with a Mann-Whitney $U$ test. Table~\ref{tab:divergence_significance} reports the group-level descriptive statistics for the gap scores and the test result.
|
||||
Separability between human and agent sessions is evaluated by computing per-session divergence gap scores $\Delta_{H,s} - \Delta_{A,s}$ and comparing the two groups with a Mann-Whitney $U$ test. The full recorded cohort contains $n_H=13$ human sessions and $n_A=16$ agent sessions, and Table~\ref{tab:divergence_significance} reports the corresponding group-level statistics and test result.
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
@@ -20,15 +20,15 @@ Separability between human and agent sessions is evaluated by computing per-sess
|
||||
\toprule
|
||||
Group & $n$ & Mean gap & Std \\
|
||||
\midrule
|
||||
Human sessions & 11 & $-3.3522$ & $2.6748$ \\
|
||||
Agent sessions & 6 & $+1.6482$ & $2.8349$ \\
|
||||
Human sessions & 13 & $-3.35$ & $2.67$ \\
|
||||
Agent sessions & 16 & $+1.65$ & $2.83$ \\
|
||||
\midrule
|
||||
\multicolumn{4}{l}{Mann-Whitney $U = 2.0$, $p = 0.0006$ (two-sided)} \\
|
||||
\multicolumn{4}{l}{Mann-Whitney two-sided test: $p<0.001$} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided $p$-value of $0.0006$ indicates near-complete rank separation between the groups at $n_H=11$, $n_A=6$, providing strong evidence that the transition kernels are separable enough to justify their use as a control signal in downstream pricing.
|
||||
The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided test result ($p<0.001$) at $n_H=13$, $n_A=16$ indicates strong rank separation between groups, providing evidence that the transition kernels are separable enough to justify their use as a control signal in downstream pricing.
|
||||
|
||||
|
||||
\subsection{Experimental Outcomes}
|
||||
@@ -55,9 +55,17 @@ Non-robust (\texttt{--no-robust}) & $3.91\mathrm{e}5$ & $4.18\mathrm{e}5$ & $1.1
|
||||
|
||||
At pair level (same seed, tier, and contamination), robust exceeds non-robust in $13/40$ configurations on objective score and in $16/40$ configurations on revenue. The current early evidence therefore suggests a conditional robustness effect: the defense is active and measurable, but not yet uniformly beneficial without further calibration.
|
||||
|
||||
\subsubsection{The Impact of Contamination on Revenue}
|
||||
|
||||
A linear slope test on run-level data ($n=95$) shows a strong negative association between contamination and mean revenue. The fitted model is
|
||||
\[
|
||||
\widehat{\text{revenue}} = 326{,}878.57 - 60{,}631.95\,\alpha,
|
||||
\]
|
||||
with $t(93)=-8.2148$, $p=1.20\times 10^{-12}$, $R^2=0.4205$, and a 95\% confidence interval for the slope of $[-75{,}288.76,\,-45{,}975.13]$. In practical terms, a $+0.1$ increase in $\alpha$ corresponds to an average decrease of about $6{,}063$ revenue units. The full derivation (sample moments, least-squares coefficients, residual variance, standard error, test statistic, and confidence interval) is reported in Appendix~\ref{app:alpha_revenue_slope}.
|
||||
|
||||
|
||||
\subsection{Interpretation and Insights}
|
||||
The Mann-Whitney result ($U=2.0$, $p<0.001$) confirms that per-session divergence gaps separate the two actor classes with near-zero overlap in rank ordering. This is the condition required for separability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score.
|
||||
The Mann-Whitney result ($p<0.001$) confirms that per-session divergence gaps separate the two actor classes with near-zero overlap in rank ordering. This is the condition required for separability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score.
|
||||
|
||||
The first calibration and overnight runs additionally confirm three practical points aligned with the thesis mechanism. First, the control loop is reproducible end-to-end (training, evaluation, artifact generation) across algorithms and contamination levels. Second, policy class materially changes price trajectories and resulting COI/revenue profiles under identical environment settings. Third, objective improvements from robustness are regime-dependent in the current baseline, which is consistent with the thesis claim that contamination-aware pricing needs explicit calibration rather than a one-size-fits-all penalty.
|
||||
|
||||
|
||||
@@ -6,6 +6,19 @@ For our troubles, we now conclude that...
|
||||
The authors contribution was not without the advice of many experienced experts in the field. We thank Marco Casalaina VP Products, Core AI and AI Futurist at Microsoft for the initial critical discussion on the topic of dynamic pricing systems and the spark which has lead to this work. Eugene Bykovets, PhD pointing out the parallels in blockchain systems and the complexity of anonymous interaction and understanding of intent. Importantly, the contributions of Alberto Martín Izquierdo, my academic advisor for the support over and for taking on the challenge of this ambitious work. Many breakthroughs were thanks to numerous discussions with my peers on the topics covered here.
|
||||
A thanks to the head of innovation at Amadeus for insight into the industry split on the topic of collapsing margins. Finally we acknowledge the power and use of generative AI technologies for in depth research, rapid prototyping and surfacing of key topics and niches.
|
||||
|
||||
Now we very explicitly mention what we contribute in this paper:
|
||||
\begin{itemize}
|
||||
\item TPU-accelerated parallelization of the behavioral simulation and reinforcement learning pipeline, making large-scale factorial sweeps tractable.
|
||||
\item Formalization of non-human transaction orchestration in e-commerce as a distinct source of contamination in dynamic pricing systems.
|
||||
\item Definition of the Cost of Information (COI) as a mechanism-level quantity for pricing power, together with a theorem showing its erosion under increasing agent saturation.
|
||||
\item Design and implementation of a controlled e-commerce research platform, built on a hybrid Kappa-Lambda architecture, for collecting and replaying high-fidelity interaction trajectories.
|
||||
\item Construction and empirical validation of a behavioral separability framework that distinguishes human and agent sessions from interaction signals alone using transition kernels and KL-based divergence.
|
||||
\item Development of a generative contamination mechanism that injects learned agent behavior into the pricing environment for controlled robustness experiments.
|
||||
\item Translation of behavioral separability into a defensive pricing mechanism through a distributionally robust reinforcement learning formulation of pricing under non-stationary contamination.
|
||||
\item Empirical evidence that agent contamination reduces revenue and that robustness is condition-dependent, requiring explicit calibration rather than a one-size-fits-all penalty.
|
||||
\item Release of a reusable public experimental artifact for reproducing and extending research on dynamic pricing under agent-mediated traffic.
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Future Works and Next Steps}
|
||||
|
||||
During the eights months of research dedicated to this work, a plethora of opportunities and industry gaps was identified, sadly a majority of which could not be addressed directly.
|
||||
|
||||
@@ -81,6 +81,110 @@ v4 & 64 & 275 & $64 \times 275 = 17{,}600$ \\
|
||||
|
||||
Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
|
||||
|
||||
\section{Full Slope-Test Derivation: Revenue vs. Contamination}
|
||||
\label{app:alpha_revenue_slope}
|
||||
|
||||
This appendix gives the full ordinary least squares computation for the linear effect of contamination on mean revenue. Let
|
||||
\[
|
||||
x_i = \texttt{study/alpha}_i, \qquad y_i = \texttt{eval/revenue\_mean}_i,
|
||||
\]
|
||||
and fit
|
||||
\[
|
||||
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,\dots,n.
|
||||
\]
|
||||
The slope test is
|
||||
\[
|
||||
H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0.
|
||||
\]
|
||||
|
||||
\subsection{Sample moments and least-squares coefficients}
|
||||
|
||||
From the data:
|
||||
\[
|
||||
n=95, \qquad \bar{x}=0.3810526316, \qquad \bar{y}=303{,}774.6096.
|
||||
\]
|
||||
Define
|
||||
\[
|
||||
S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})^2, \qquad S_{xy}=\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}).
|
||||
\]
|
||||
Numerically,
|
||||
\[
|
||||
S_{xx}=7.0508947368, \qquad S_{xy}=-427{,}509.4691.
|
||||
\]
|
||||
The least-squares slope and intercept are
|
||||
\[
|
||||
\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{-427{,}509.4691}{7.0508947368} = -60{,}631.9460,
|
||||
\]
|
||||
\[
|
||||
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} = 303{,}774.6096 - (-60{,}631.9460)(0.3810526316) = 326{,}878.5722.
|
||||
\]
|
||||
So the fitted line is
|
||||
\[
|
||||
\hat{y} = 326{,}878.5722 - 60{,}631.9460\,x.
|
||||
\]
|
||||
|
||||
\subsection{Residual variance and standard error of the slope}
|
||||
|
||||
For each observation, $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ and $e_i = y_i - \hat{y}_i$. The residual sum of squares is
|
||||
\[
|
||||
\mathrm{SSE} = \sum_{i=1}^{n} e_i^2 = 35{,}721{,}896{,}352.27375.
|
||||
\]
|
||||
With $df=n-2=93$,
|
||||
\[
|
||||
\mathrm{MSE} = \frac{\mathrm{SSE}}{n-2} = \frac{35{,}721{,}896{,}352.27375}{93} = 384{,}106{,}412.3900.
|
||||
\]
|
||||
The slope standard error is
|
||||
\[
|
||||
SE(\hat{\beta}_1) = \sqrt{\frac{\mathrm{MSE}}{S_{xx}}} = \sqrt{\frac{384{,}106{,}412.3900}{7.0508947368}} = 7{,}380.8038.
|
||||
\]
|
||||
|
||||
\subsection{t-statistic, p-value, and confidence interval}
|
||||
|
||||
Under $H_0: \beta_1=0$,
|
||||
\[
|
||||
t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{-60{,}631.9460}{7{,}380.8038} = -8.2148,
|
||||
\]
|
||||
with $df=93$. The two-sided p-value is
|
||||
\[
|
||||
p = 2\,\Pr\left(T_{93} \ge |t|\right) = 1.2038\times 10^{-12}.
|
||||
\]
|
||||
The 95\% confidence interval is
|
||||
\[
|
||||
\hat{\beta}_1 \pm t_{0.975,93}\,SE(\hat{\beta}_1)
|
||||
= -60{,}631.9460 \pm (1.9858)(7{,}380.8038)
|
||||
= [-75{,}288.7597,\,-45{,}975.1324].
|
||||
\]
|
||||
|
||||
\subsection{Effect size and fit statistics}
|
||||
|
||||
The sample correlation is $r=-0.64846$, so
|
||||
\[
|
||||
R^2 = r^2 = 0.4205.
|
||||
\]
|
||||
Hence, 42.05\% of the variation in \texttt{eval/revenue\_mean} is explained by a linear trend in \texttt{study/alpha}.
|
||||
|
||||
The slope interpretation is direct:
|
||||
\[
|
||||
\hat{\beta}_1 = -60{,}631.9460 \quad \Rightarrow \quad \Delta y \approx -6{,}063.19 \text{ for } \Delta x = +0.1.
|
||||
\]
|
||||
From $\alpha=0$ to $\alpha=0.8$, the fitted drop is
|
||||
\[
|
||||
0.8\times (-60{,}631.9460) = -48{,}505.5568,
|
||||
\]
|
||||
so the model predicts roughly $48{,}506$ lower revenue units on average.
|
||||
|
||||
\subsection{Conclusion of the slope test}
|
||||
|
||||
The estimated model is
|
||||
\[
|
||||
\hat{y}=326{,}878.57-60{,}631.95\,x,
|
||||
\]
|
||||
with
|
||||
\[
|
||||
t(93)=-8.2148, \qquad p=1.2038\times 10^{-12}, \qquad 95\%\,\text{CI}=[-75{,}288.76,\,-45{,}975.13].
|
||||
\]
|
||||
The slope is therefore strongly negative and statistically different from zero.
|
||||
|
||||
% \input{../build/concatenated_code}
|
||||
|
||||
\end{document}
|
||||
|
||||
@@ -233,7 +233,7 @@ To train a robust pricing learner, we need a simulator that can generate realist
|
||||
\subsubsection{GOFAI-Based Weak Labeling.}
|
||||
We use Good Old-Fashioned AI (GOFAI) heuristics to generate weak labels for separability. A set of rule-based predicates $\phi_j: \tau \to \{0,1\}$ partitions dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$. We then estimate separate transition models for both groups and ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?
|
||||
|
||||
To answer this, we compute average KL divergence between transition probability matrices. This statistic gives global separability and event-level diagnostics at the same time. In our balanced dataset (50\% human, 50\% agent), the average divergence is approximately $1.8$.
|
||||
To answer this, we compute average KL divergence between transition probability matrices. This statistic gives global separability and event-level diagnostics at the same time. In our recorded dataset (13 human sessions, 16 agent sessions; 45\%/55\%), the average divergence is approximately $1.8$.
|
||||
|
||||
\begin{definition}[KL Divergence for Transition Distributions]
|
||||
Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
|
||||
|
||||
@@ -109,7 +109,7 @@ Since users act with motivations, we define a pool of tasks (jobs to be done) an
|
||||
|
||||
A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor.
|
||||
|
||||
The human data collection involved 18 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 18 human sessions we ran 18 agent sessions of equivalent task scope, giving a balanced dataset of 36 labeled trajectories. Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.
|
||||
The human data collection involved 13 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 13 human sessions we ran 16 agent sessions of equivalent task scope, yielding 29 labeled trajectories in total (45\% human, 55\% agent). Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.
|
||||
|
||||
To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
|
||||
\subsection{Behavioral Analysis}
|
||||
|
||||
Separability between human and agent sessions is evaluated by computing per-session divergence gap scores (how much closer each session is to the human baseline versus the agent baseline) and comparing the two groups with a Mann-Whitney U test. The table below reports the group-level descriptive statistics for the gap scores and the test result.
|
||||
Separability between human and agent sessions is evaluated by computing per-session divergence gap scores (how much closer each session is to the human baseline versus the agent baseline) and comparing the two groups with a Mann-Whitney U test. The full recorded cohort contains 13 human sessions and 16 agent sessions, and the table below reports the corresponding group-level statistics and test result.
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
@@ -18,15 +18,15 @@ Separability between human and agent sessions is evaluated by computing per-sess
|
||||
\toprule
|
||||
Group & n & Mean gap & Std \\
|
||||
\midrule
|
||||
Human sessions & 11 & $-3.3522$ & $2.6748$ \\
|
||||
Agent sessions & 6 & $+1.6482$ & $2.8349$ \\
|
||||
Human sessions & 13 & $-3.35$ & $2.67$ \\
|
||||
Agent sessions & 16 & $+1.65$ & $2.83$ \\
|
||||
\midrule
|
||||
\multicolumn{4}{l}{Mann-Whitney $U = 2.0$, $p = 0.0006$ (two-sided)} \\
|
||||
\multicolumn{4}{l}{Mann-Whitney two-sided test: $p<0.001$} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided p-value of 0.0006 (which means there is only a 0.06\% chance this pattern occurred by random luck) indicates near-complete rank separation between the groups at n=11 humans and n=6 agents, providing strong evidence that the transition kernels are separable enough to justify their use as a control signal in downstream pricing.
|
||||
The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided test result (p less than 0.001) at n=13 humans and n=16 agents indicates strong rank separation between groups, providing evidence that the transition kernels are separable enough to justify their use as a control signal in downstream pricing.
|
||||
|
||||
\subsection{Experimental Outcomes}
|
||||
|
||||
@@ -50,6 +50,6 @@ This comparison isolates the effect of robustness terms from model capacity and
|
||||
|
||||
\subsection{Interpretation and Insights}
|
||||
|
||||
The Mann-Whitney result (U=2.0, p less than 0.001) confirms that per-session divergence gaps separate the two actor classes with near-zero overlap in rank ordering. This is the condition required for separability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score.
|
||||
The Mann-Whitney result (p less than 0.001) confirms that per-session divergence gaps separate the two actor classes with near-zero overlap in rank ordering. This is the condition required for separability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score.
|
||||
|
||||
\subsection{Anomalies}
|
||||
|
||||
Reference in New Issue
Block a user