PHANTOM/paper/src/chapters/06-conclusion.tex

\section{Conclusion}
\label{sec:conclusion}

This thesis examined reinforcement-learning policies for dynamic pricing when a fraction of traffic is orchestrated by non-human agents intent on extracting information before purchase. We introduced COI-oriented metrics, a behavioral distinguishability layer, and a distributionally robust training loop; empirical runs show where robustness helps and where it must be tuned.

\subsection{Summary of contributions}
Our work has yielded a broad set of dependencies which we carefully orchestrated to give us measurable results. To give a clear picture we outline the specific contributions of each stage of our work. The theoretical component formalizes why agent-mediated reconnaissance erodes pricing power, the behavioral component establishes that such contamination is detectable from interaction traces alone, the control component translates that distinguishability into a robust pricing mechanism, and the systems component provides the controlled experimental environment required to observe, test, and reproduce these effects.

\begin{itemize}
    \item TPU-accelerated parallelization of the behavioral simulation and reinforcement learning pipeline, making large factorial sweeps tractable.
    \item Formalization of non-human transaction orchestration in e-commerce as a distinct source of contamination in dynamic pricing systems.
    \item Definition of the Cost of Information (COI) as a mechanism-level quantity for pricing power, together with a theorem on its erosion under increasing agent saturation.
    \item Design and implementation of a controlled e-commerce research platform on a hybrid Kappa--Lambda architecture for collecting and replaying high-fidelity interaction trajectories.
    \item Construction and empirical validation of a behavioral distinguishability framework that separates human and agent sessions from interaction signals alone using transition kernels and KL-based divergence.
    \item A generative contamination mechanism that injects learned agent behavior into the pricing environment for controlled robustness experiments.
    \item Translation of distinguishability scores into defensive pricing via distributionally robust reinforcement learning under non-stationary contamination.
    \item Evidence that contamination depresses revenue and that robustness gains are regime-dependent, so penalties and radii need calibration rather than a single default.
    \item Release of a public experimental artifact (code and dataset) for reproducing and extending work on agent-mediated traffic.
\end{itemize}

\subsection{Limitations and future work}

Several constraints are intentional and could be relaxed later. Action weights in the demand proxy are hand-set; learning them from data is an obvious next step. The Stackelberg interface assumes a clean alternation between platform move and market response; richer histories (multi-agent, multi-platform) would need a less rigid state definition. Non-perishable catalog supply in the simulator widens the sim-to-real gap for inventory-constrained domains. Within-session contamination is modeled as stable; time-varying $\alpha$ inside a session would better match some attack patterns.

Before any deployment, human baselines should grow beyond the convenience sample used here, and catalog scaling laws should be re-checked when transition matrices grow with SKU count. For the deployment of this methodology presented in our work.