rephrasing some things and updating language

This commit is contained in:
2026-04-09 09:30:23 +02:00
parent 47b07daa6c
commit ace52e8e14
11 changed files with 70 additions and 67 deletions

View File

@@ -1,26 +1,22 @@
\section{Conclusion}
Our research has explored how reinforcement learning works within pricing systems and environments which are substantially disrupted by an adversarial participant. Our findings include the optimization for our newly introduced metrics.
This thesis examined reinforcement-learning policies for dynamic pricing when a fraction of traffic is orchestrated by non-human agents intent on extracting information before purchase. We introduced COI-oriented metrics, a behavioral distinguishability layer, and a distributionally robust training loop; empirical runs show where robustness helps and where it must be tuned.
\subsection{Summary of contributions}
The contribution was not without the advice of many experienced experts in the field. We thank Marco Casalaina VP Products, Core AI and AI Futurist at Microsoft for the initial critical discussion on the topic of dynamic pricing systems and the spark which has lead to this work. Eugene Bykovets, PhD pointing out the parallels in blockchain systems and the complexity of anonymous interaction and understanding of intent. Importantly, the contributions of Alberto Martín Izquierdo, my academic advisor for the support over and for taking on the challenge of this ambitious work. Many breakthroughs were thanks to numerous discussions with my peers on the topics covered here.
A thanks to the head of innovation at Amadeus for insight into the industry split on the topic of collapsing margins. Finally we acknowledge the power and use of generative AI technologies for in depth research, rapid prototyping and surfacing of key topics and niches.
Now we very explicitly mention what we contribute in this paper:
\begin{itemize}
\item TPU-accelerated parallelization of the behavioral simulation and reinforcement learning pipeline, making large-scale factorial sweeps tractable.
\item TPU-accelerated parallelization of the behavioral simulation and reinforcement learning pipeline, making large factorial sweeps tractable.
\item Formalization of non-human transaction orchestration in e-commerce as a distinct source of contamination in dynamic pricing systems.
\item Definition of the Cost of Information (COI) as a mechanism-level quantity for pricing power, together with a theorem showing its erosion under increasing agent saturation.
\item Design and implementation of a controlled e-commerce research platform, built on a hybrid Kappa-Lambda architecture, for collecting and replaying high-fidelity interaction trajectories.
\item Construction and empirical validation of a behavioral distinguishability framework that distinguishes human and agent sessions from interaction signals alone using transition kernels and KL-based divergence.
\item Development of a generative contamination mechanism that injects learned agent behavior into the pricing environment for controlled robustness experiments.
\item Translation of behavioral distinguishability into a defensive pricing mechanism through a distributionally robust reinforcement learning formulation of pricing under non-stationary contamination.
\item Empirical evidence that agent contamination reduces revenue and that robustness is condition-dependent, requiring explicit calibration rather than a one-size-fits-all penalty.
\item Release of a reusable public experimental artifact for reproducing and extending research on dynamic pricing under agent-mediated traffic.
\item Definition of the Cost of Information (COI) as a mechanism-level quantity for pricing power, together with a theorem on its erosion under increasing agent saturation.
\item Design and implementation of a controlled e-commerce research platform on a hybrid Kappa--Lambda architecture for collecting and replaying high-fidelity interaction trajectories.
\item Construction and empirical validation of a behavioral distinguishability framework that separates human and agent sessions from interaction signals alone using transition kernels and KL-based divergence.
\item A generative contamination mechanism that injects learned agent behavior into the pricing environment for controlled robustness experiments.
\item Translation of distinguishability scores into defensive pricing via distributionally robust reinforcement learning under non-stationary contamination.
\item Evidence that contamination depresses revenue and that robustness gains are regime-dependent, so penalties and radii need calibration rather than a single default.
\item Release of a public experimental artifact (code and dataset) for reproducing and extending work on agent-mediated traffic.
\end{itemize}
\subsection{Future Works and Next Steps}
\subsection{Limitations and future work}
In our effort to tackle this work we initiated a set of constraints which we hope to relax in future iterations and hope that some of these will be addressed in industry. First of these constraints is the weighting of different actions within the demand estimation, which we would ideally find through learned methodology. Next, assumption of perfect alternating turns between the platform and the market calls for a fixed length non-strictly alternating state definition with a history of actions to possibly allow for the development of multi agentic or multi platform simulation. In our simulation we also make assumptions of non-perishable supply of items, which creates the biggest sim-to-real gap in our system. We also would like to further remove intra-session stationary nature of the contamination parameter to further create high-fidelity non-stationarity within a single evaluation window.
Several constraints are intentional and could be relaxed later. Action weights in the demand proxy are hand-set; learning them from data is an obvious next step. The Stackelberg interface assumes a clean alternation between platform move and market response; richer histories (multi-agent, multi-platform) would need a less rigid state definition. Non-perishable catalog supply in the simulator widens the sim-to-real gap for inventory-constrained domains. Within-session contamination is modeled as stable; time-varying $\alpha$ inside a session would better match some attack patterns.
For deployment of this it is advised to collect a higher sample size of human baselines and to complement this with the simulated agentic sessions and to mind the matrix scaling for very large catalog sizes.
Before any deployment, human baselines should grow beyond the convenience sample used here, and catalog scaling laws should be re-checked when transition matrices grow with SKU count. For the deployment of this methodology presented in our work.