PHANTOM/paper/src/chapters/04-results.tex

\section{Results}
\label{sec:results}

% The gap we target is not detection for its own sake but whether behavioral signals can support pricing decisions once agent traffic is present. This section follows the supporting questions in \cref{sec:research_questions}: we first establish session-level distinguishability (behavioral evidence and a rank test), then estimate how contamination shifts revenue in a controlled sweep, and finally compare robust and baseline policies under factorial training with COI and revenue readouts. The ordering is deliberate---each stage feeds the next so that separability, contamination effects, and policy outcomes form one connected line of evidence.

In our work, the gap we target is not the detection for its own sake. Our aim is to understand behavioral signals which can support pricing decisions once agent traffic is present. Now we set to conclude and piece together the path we laid out in \cref{sec:research_questions}. We established distinguishability (behavioral evidence and test) that estimate how contamination shifts revenue in an adversarial environment and finally we compare robust and baseline pricing under factorial training.

\begin{figure}[ht]
    \centering
    \input{chapters/figures/supra/supra.tex}
    \caption{Evolution of price distributions over experiment steps. The heatmap illustrates the density of price offerings. This is an early baseline simulation which demonstrates supra-competitive price-setting in deep learning agents such as Soft Actor Critic as can be clearly seen by the high density at the highest available price.}
    \label{fig:supra_heatmap}
\end{figure}


\subsection{Behavioral Analysis}

Distinguishability between human and agent sessions is evaluated by computing per-session divergence gap scores $\Delta_{H,s} - \Delta_{A,s}$ and comparing the two groups with a Mann-Whitney $U$ test. The full recorded cohort contains $n_H=13$ human sessions and $n_A=16$ agent sessions, and Table~\ref{tab:divergence_significance} reports the corresponding group-level statistics and test result.

\begin{table}[ht]
\centering
\caption{Per-session divergence gap ($\Delta_H - \Delta_A$) by actor class with Mann-Whitney $U$ test.}
\label{tab:divergence_significance}
\begin{tabular}{lccc}
\toprule
Group & $n$ & Mean gap & Std \\
\midrule
Human sessions & 13 & $-3.35$ & $2.67$ \\
Agent sessions & 16 & $+1.65$ & $2.83$ \\
\midrule
\multicolumn{4}{l}{Mann-Whitney two-sided test: $p<0.001$} \\
\bottomrule
\end{tabular}
\end{table}

The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided test result ($p<0.001$) at $n_H=13$, $n_A=16$ indicates strong rank distinction between groups, providing evidence that the transition kernels are distinguishable enough to justify their use as a control signal in downstream pricing.


\subsection{Experimental Outcomes}

To evaluate robustness contributions, we compare two policies on the same environment family: (i) robust pricing with COI-aware reward and adversarial contamination step, and (ii) a baseline policy with revenue-only reward.

We report two preliminary stages before the full factorial interpretation. First, we executed a short calibration run at $\alpha=0.3$ (2 evaluation episodes, 3000 training timesteps per tier) across \texttt{qtable}, \texttt{ppo}, \texttt{a2c}, and \texttt{dqn}. In that first run, \texttt{ppo} produced the highest objective score and revenue (objective $=3.76\mathrm{e}5$, revenue $=4.15\mathrm{e}5$), while the remaining tiers stayed lower in this small-budget regime. The corresponding price traces show a monotone escalation for \texttt{ppo} (mean price from $8.61\mathrm{e}1$ to $1.49\mathrm{e}2$), whereas \texttt{qtable}, \texttt{a2c}, and \texttt{dqn} remained nearly flat over the episode horizon. This confirms that the simulation loop is able to express policy-dependent pricing dynamics rather than collapsing into a single trajectory shape.


\subsubsection{The Impact of Contamination on Revenue}

The contamination--revenue slope is estimated on a controlled cohort (single sweep, baseline policy, $n_{\text{products}}=100$, $n=95$). In this setting, contamination $\alpha$ is set exogenously by the experiment, so the slope identifies the within-sweep causal effect of contamination on revenue under fixed policy and environment settings. These results are in favor of our second research question \hyperlink{sq2}{\textbf{SQ2}} (\textit{Theoretical Impact}) from \cref{sec:research_questions}.

\begin{table}[ht]
\centering
\caption{Slope verification table for contamination versus revenue.}
\label{tab:contamination_slope_table}
\begin{tabular}{@{}lrrrrr@{}}
\toprule
Term & Coef. & Std. Err. & $t$ & $p>|t|$ & 95\% CI \\
\midrule
Intercept & 348,823.41 & 784.29 & 444.77 & $<10^{-99}$ & $[347,264.96,\,350,381.86]$ \\
$\alpha$ & $-90,140.53$ & 1,466.90 & $-61.45$ & $4.27\times10^{-77}$ & $[-93,053.38,\,-87,227.68]$ \\
\midrule
HC1 robust check ($\alpha$) & $-90,140.53$ & 2,185.22 & $-41.25$ & $1.42\times10^{-61}$ & -- \\
\bottomrule
\end{tabular}
\end{table}

Interpreted on the contamination grid, a $+0.1$ increase in $\alpha$ corresponds to an average revenue decrease of about $9{,}014$ units, and the robust check preserves both direction and significance.
% TODO: add a compact proposal note for re-running tests with statsmodels in the appendix methodology notes.

\subsubsection{Large Scale Factorial Training}

In our complete training runs we logged $\approx 180$ days of net compute time. The results we draw from extensive training are
\begin{enumerate*}[label=(\roman*)]
  \item the ability to extract COI is greater in the presence of robustness within the training loop
  \item short term revenue measurements suffer $\approx 3\%$ loss but COI margin compensates for this loss in the long run
  \item a larger catalog size contributes positively to COI preservation under higher contamination ratios
  \item supra-competitive pricing is a natural reward hacking tendency which is drastically reduced by a balanced UX penalty
\end{enumerate*}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_revenue_by_alpha.tex}
    \caption{Revenue curves by contamination for the final cohort. The baseline remains above the defended curve in most cells, but the gap narrows in the high-contamination region.}
    \label{fig:final_focus_revenue_by_alpha}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_coi_by_alpha.tex}
    \caption{COI level curves by contamination for the final cohort. The shaded band marks the per-$\alpha$ gap between defended and baseline policies.}
    \label{fig:final_focus_coi_by_alpha}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_coi_preservation_grid.tex}
    \caption{COI preservation by product count at the contamination endpoints ($\alpha=0.0$ and $\alpha=1.0$). Bars report defended-minus-baseline mean COI level, with the zero line separating preservation from erosion.}
    \label{fig:final_focus_coi_preservation_grid}
\end{figure}


\subsection{Interpretation and Insights}
The Mann-Whitney result ($p<0.001$) confirms that per-session divergence gaps distinguish the two actor classes with near-zero overlap in rank ordering. This is the condition required for distinguishability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score. This is a direct result relevant to our first pillar \hyperlink{sq1}{\textbf{SQ1}} (\textit{Distinguishability}) from \cref{sec:research_questions}.

The first calibration and paired benchmark runs additionally confirm three practical points aligned with the thesis. First, the control loop is reproducible end-to-end (training, evaluation, artifact generation) across algorithms and contamination levels. Second, policy class materially changes price trajectories and resulting COI/revenue profiles under identical environment settings. Third, objective improvements from robustness are regime-dependent in the current baseline, which is consistent with the thesis claim that contamination-aware pricing needs explicit calibration rather than a one-size-fits-all penalty.

We also note that maximizing revenue in isolation can favor aggressive high-price behavior, even in our early runs, the non-robust aggregate shows slightly higher mean COI and margin. For this reason, all subsequent reporting in this thesis is interpreted on a multi-metric basis (objective, revenue, COI, and stability), and not by revenue alone. This is another direct answer to our third pillar \hyperlink{sq3}{\textbf{SQ3}} (\textit{Robust Mitigation}) from \cref{sec:research_questions}.


\subsection{Anomalies}
In our initial runs, we observed an instability pocket in one completed run (A2C, robust, seed 11, $\alpha=0.30$) with a large performance drop relative to neighboring configurations. We retain this run in the preliminary summary to avoid survivorship bias and treat it as evidence that robustness sensitivity analysis is necessary before final conclusions.