PHANTOM/paper/src/chapters/04-results.tex

\section{Results}
\begin{figure}[ht]
    \centering
    \input{chapters/figures/supra/supra.tex}
    \caption{Evolution of price distributions over experiment steps. The heatmap illustrates the density of price offerings. This is an early baseline simulation which demonstrates supra-competitive price-setting in deep learning agents such as SAC as can be clearly seen by the high density at the highest available price.}
    \label{fig:supra_heatmap}
\end{figure}


\subsection{Behavioral Analysis}

Distinguishability between human and agent sessions is evaluated by computing per-session divergence gap scores $\Delta_{H,s} - \Delta_{A,s}$ and comparing the two groups with a Mann-Whitney $U$ test. The full recorded cohort contains $n_H=13$ human sessions and $n_A=16$ agent sessions, and Table~\ref{tab:divergence_significance} reports the corresponding group-level statistics and test result.

\begin{table}[ht]
\centering
\caption{Per-session divergence gap ($\Delta_H - \Delta_A$) by actor class with Mann-Whitney $U$ test.}
\label{tab:divergence_significance}
\begin{tabular}{lccc}
\toprule
Group & $n$ & Mean gap & Std \\
\midrule
Human sessions & 13 & $-3.35$ & $2.67$ \\
Agent sessions & 16 & $+1.65$ & $2.83$ \\
\midrule
\multicolumn{4}{l}{Mann-Whitney two-sided test: $p<0.001$} \\
\bottomrule
\end{tabular}
\end{table}

The sign structure is consistent with the theoretical expectation: human sessions produce negative gap scores (closer to the human centroid, far from the agent centroid) while agent sessions produce positive gap scores (closer to the agent centroid). The two-sided test result ($p<0.001$) at $n_H=13$, $n_A=16$ indicates strong rank distinction between groups, providing evidence that the transition kernels are distinguishable enough to justify their use as a control signal in downstream pricing.


\subsection{Experimental Outcomes}

To evaluate robustness contributions, we compare two policies on the same environment family: (i) robust pricing with COI-aware reward and adversarial contamination step, and (ii) a baseline policy with revenue-only reward.

We report two preliminary stages before the full factorial interpretation. First, we executed a short calibration run at $\alpha=0.3$ (2 evaluation episodes, 3000 training timesteps per tier) across \texttt{qtable}, \texttt{ppo}, \texttt{a2c}, and \texttt{dqn}. In that first run, \texttt{ppo} produced the highest objective score and revenue (objective $=3.76\mathrm{e}5$, revenue $=4.15\mathrm{e}5$), while the remaining tiers stayed lower in this small-budget regime. The corresponding price traces show a monotone escalation for \texttt{ppo} (mean price from $8.61\mathrm{e}1$ to $1.49\mathrm{e}2$), whereas \texttt{qtable}, \texttt{a2c}, and \texttt{dqn} remained nearly flat over the episode horizon. This confirms that the simulation loop is able to express policy-dependent pricing dynamics rather than collapsing into a single trajectory shape.


\subsubsection{The Impact of Contamination on Revenue}

The contamination--revenue slope is estimated on a controlled cohort (single sweep, baseline policy, $n_{\text{products}}=100$, $n=95$). In this setting, contamination $\alpha$ is set exogenously by the experiment, so the slope identifies the within-sweep causal effect of contamination on revenue under fixed policy and environment settings.

\begin{table}[ht]
\centering
\caption{Slope verification table for contamination versus revenue (OLS-style report).}
\label{tab:contamination_slope_table}
\begin{tabular}{@{}lrrrrr@{}}
\toprule
Term & Coef. & Std. Err. & $t$ & $p>|t|$ & 95\% CI \\
\midrule
Intercept & 348,823.41 & 784.29 & 444.77 & $<10^{-99}$ & $[347,264.96,\,350,381.86]$ \\
$\alpha$ & $-90,140.53$ & 1,466.90 & $-61.45$ & $4.27\times10^{-77}$ & $[-93,053.38,\,-87,227.68]$ \\
\midrule
HC1 robust check ($\alpha$) & $-90,140.53$ & 2,185.22 & $-41.25$ & $1.42\times10^{-61}$ & -- \\
\bottomrule
\end{tabular}
\end{table}

Interpreted on the contamination grid, a $+0.1$ increase in $\alpha$ corresponds to an average revenue decrease of about $9{,}014$ units, and the robust check preserves both direction and significance.
% TODO: add a compact proposal note for re-running tests with statsmodels in the appendix methodology notes.

\subsubsection{Large Scale Factorial Training}

In our complete training runs we logged $\approx 180$ days of net compute time. The results we draw from extensive training are
\begin{enumerate*}[label=(\roman*)]
  \item the ability to extract COI is greater in the presence of robustness within the training loop
  \item short term revenue measurements suffer $\approx 3\%$ loss but COI margin compensates for this loss in the long run
  \item a larger catalog size contributes positively to COI preservation under higher contamination ratios
  \item supra-competitive pricing is a natural reward hacking tendency which is drastically reduced by a balanced UX penalty
\end{enumerate*}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_revenue_by_alpha.tex}
    \caption{Revenue curves by contamination for the final cohort. The baseline remains above the defended curve in most cells, but the gap narrows in the high-contamination region.}
    \label{fig:final_focus_revenue_by_alpha}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_coi_by_alpha.tex}
    \caption{COI level curves by contamination for the final cohort. The shaded band marks the per-$\alpha$ gap between defended and baseline policies.}
    \label{fig:final_focus_coi_by_alpha}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_coi_preservation_grid.tex}
    \caption{COI preservation by product count at the contamination endpoints ($\alpha=0.0$ and $\alpha=1.0$). Bars report defended-minus-baseline mean COI level, with the zero line separating preservation from erosion.}
    \label{fig:final_focus_coi_preservation_grid}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_revenue_delta.tex}
    \caption{Defended-minus-baseline revenue delta over contamination for the final cohort. The strongest high-contamination deviation begins at $\alpha=0.7$, followed by recovery toward near parity by $\alpha=1.0$.}
    \label{fig:final_focus_revenue_delta}
\end{figure}

\begin{figure}[ht]
    \centering
    \input{chapters/figures/results/includes/final_focus_risk_deltas.tex}
    \caption{Defended-minus-baseline leakage and volatility deltas for the final cohort. Leakage remains lower for the defended policy across the full contamination range.}
    \label{fig:final_focus_risk_deltas}
\end{figure}

\subsection{Interpretation and Insights}
The Mann-Whitney result ($p<0.001$) confirms that per-session divergence gaps distinguish the two actor classes with near-zero overlap in rank ordering. This is the condition required for distinguishability to act as a useful control signal in the pricing loop rather than just an auxiliary classifier score.

The first calibration and paired benchmark runs additionally confirm three practical points aligned with the thesis. First, the control loop is reproducible end-to-end (training, evaluation, artifact generation) across algorithms and contamination levels. Second, policy class materially changes price trajectories and resulting COI/revenue profiles under identical environment settings. Third, objective improvements from robustness are regime-dependent in the current baseline, which is consistent with the thesis claim that contamination-aware pricing needs explicit calibration rather than a one-size-fits-all penalty.

We also note that maximizing revenue in isolation can favor aggressive high-price behavior; even in these early runs, the non-robust aggregate shows slightly higher mean COI and margin. For this reason, all subsequent reporting in this thesis is interpreted on a multi-metric basis (objective, revenue, COI, and stability), and not by revenue alone.


\subsection{Anomalies}
In our initial runs, we observed an instability pocket in one completed run (A2C, robust, seed 11, $\alpha=0.30$) with a large performance drop relative to neighboring configurations. We retain this run in the preliminary summary to avoid survivorship bias and treat it as evidence that robustness sensitivity analysis is necessary before final conclusions.