\section{Methodology}

% Extra notes and clarifications: we observed some humans and get their transition probabilities between event types
% We modify behavioral profiles of transition matrices with price elasticity matrices generated by sample valuations of a distributing.

This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven separability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.

\subsection{Problem Formalization}

We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.

Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
\begin{equation}
e_{s,k} = (a_{s,k}, i_{s,k}, t_{s,k})
\end{equation}
where:
\begin{itemize}
    \item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
    \item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
    \item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
\end{itemize}

The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
\begin{equation}
\label{eq:qhat}
\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbb{1}[i_{s,k} = i]
\end{equation}
where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay.

In the current engine implementation, we use the normalized variant of this proxy for each step:
\begin{equation}
\tilde q_{t,i} = 100 \cdot \frac{\hat q_{t,i}}{\sum_{j=1}^{N}\hat q_{t,j} + \varepsilon}
\end{equation}
with fixed category-level weights (cart, dwell, nav, filter) following the same rank order from Table~\ref{tab:action_space}. This keeps the signal dense and directly usable in the simulator.

\subsubsection{Actor Types and Demand Curves}
We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the naively defined mixture:
\begin{equation}
\label{eq:mixture_demand}
Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
\end{equation}
where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.


\subsection{Cost of Information (COI) Framework}

The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. COI measures the revenue at risk when information asymmetry collapses.

\begin{definition}[Cost of Information]
Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
\begin{equation}
\text{COI} = \mathbb{E}[P] - \underline{p}
\end{equation}
where $\mathbb{E}[P]$ is the expected price charged by the policy and $\underline{p}$ is the minimum viable price (marginal cost).
% Alternative survival function representation (used in proof):
% COI = \int_{\underline{p}}^{\bar{p}} (1 - F_\pi(p)) \, dp
% where F_\pi(p) is the CDF of prices generated by \pi
\end{definition}

\begin{figure}[ht]
    \centering
    \begin{tikzpicture}[scale=1.2]
        % Define the Gaussian function: centered at 2
        \def\bellcurve(#1){1.5 * exp(-0.5*((#1-2)/0.6)^2)}

        % Draw the main axis
        \draw[->, thick] (0, 0) -- (4.5, 0) node[right] {$p$};
        \draw[->, thick] (0, 0) -- (0, 2) node[above] {Density};

        \draw[thick, smooth, samples=100] plot[domain=0:4] (\x, {\bellcurve(\x)});
        \node at (3.2, 1.2) {$f_\pi(p)$};

        % Define p_min and E[p]
        \def\pmin{0.8}
        \def\mean{2}

        % Vertical lines
        \draw[dashed] (\pmin, 0) -- (\pmin, 2.0);
        \draw[dashed] (\mean, 0) -- (\mean, 2.0);

        % Labels on axis
        \node[below] at (\pmin, 0) {$\underline{p}$};
        \node[below] at (\mean, 0) {$\mathbb{E}[p]$};

        \draw[<->, thick, red] (\pmin, 2.0) -- (\mean, 2.0) node[midway, above] {COI};

    \end{tikzpicture}
    \caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$.}
    \label{fig:coi_illustration}
\end{figure}

We now formally demonstrate that standard dynamic pricing mechanisms are not incentive-compatible with high-frequency agentic traffic. As the number of independent competitive agents $N$ querying the system grows, the platform's ability to sustain a COI vanishes.

  A fundamental assumption for our claim lays in the alignment of the AI agent through it's prompt which has been demonstrated by \cite{fish_algorithmic_2025} to cause strong collusive behavior under linguistic nudges. This assumption can be generalized to the human user asking the agent to research products with a minimizing objective.

\begin{theorem}[COI Erosion in the Limit]
Let $N$ be the number of independent, utility-maximizing agents querying the platform. Let $p_{(1)}$ be the first order statistic (minimum) of the prices offered to these agents. As $N \to \infty$, the Cost of Information converges to 0.
\end{theorem}


\begin{proof}
Consider $N$ independent agents querying the platform, each receiving a price sample $p_i$ drawn from the pricing policy's distribution $F(p)$ with support $[\underline{p}, \bar{p}]$. A strategic agent conducting reconnaissance will select the minimum observed price: $p_{(1)} = \min(p_1, \ldots, p_N)$.
% support here means that its the range of possible outputs.
The probability that the minimum price exceeds some threshold $t$ is:
\begin{equation}
P(p_{(1)} > t) = P(\text{all } p_i > t) = [1 - F(t)]^N
\end{equation}

For any price $t > \underline{p}$, the CDF satisfies $F(t) > 0$, so $1 - F(t) < 1$. As $N$ grows, this probability decays exponentially: $[1 - F(t)]^N \to 0$.

The expected minimum price can be written as:
\begin{equation}
\mathbb{E}[p_{(1)}] = \underline{p} + \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt
\end{equation}

Since the integrand vanishes as $N \to \infty$ for all $t > \underline{p}$, the integral converges to zero. Therefore:
\begin{equation}
\lim_{N \to \infty} \text{COI} = \lim_{N \to \infty} (\mathbb{E}[p_{(1)}] - \underline{p}) = 0
\end{equation}
\end{proof}


This result naively proves that standard pricing policies $\pi$ fail to extract surplus in the presence of large-scale agentic search, necessitating a robust counter-mechanism.

% The DRO objective creates a lower bound on COI extraction, effectively guaranteeing a minimum margin even in the presence of adversarial agents. we need to prove this and demonstrate that in a theorem.


%Mathematical demonstration and validation of the COI and citation backed evidence, and framework overview + show harm to user via other cost distortions. Maybe split into 3.2.1 (COI Theory) and 3.2.2 (Framework Design)

\subsection{System Architecture: Hybrid Kappa-Lambda Architecture}

In order for our research to have grounding in interactions we built a robust e-commerce web-platform. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. Our analysis revealed a clear industry standard: while both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment level argument which adjusts the proxy routing with a custom middleware inside next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application.


The architecture of this platform begins with the deployed web-apps posting interaction data to our backend which processes them and stores each ingested interaction into a kafka cluster. This serves as our data reservoir tracking and associating each interaction with its session and importantly with which experiment it belongs to. Not only do we track the behavioral interactions, but our pricing provider micro-service, once called by the frontend reports the observed/queried price-product into kafka. This kafka cluster is subscribed to by our pipeline which is configured on a schedule in Airflow, with the possibility of manual trigger. The final stage of the pricing pipeline, submits computed dynamic pricing results into a redis database for quick updates which is then read by the pricing provider and displayed on the webapp. This is a very generic end-to-end mechanism which is applicable to a variety of different e-commerce tasks. We intentionally put emphasis on the development of this infrastructure to establish a reproducible framework for interaction and to minimize any noise.

We transition the Kappa like architecture of the data collection to a Lambda system for actual learning in a surrogate environment. This allows us to move faster on data which is provided and helps us create a feedback loop for production deployment.


\subsubsection{DevOps Principles}

Reproducible results are key to quality research platforms, this is taken into mind when deploying and working with our research platform. From a deployment standpoint the platform can be deployed across a large variety of providers and can be run locally. When developing a new interaction modality apart from the ones that come out of the box, a simple template pattern can be followed. The middleware of the framework is designed to properly render the chosen modality from environmental variables, thus deployment of different or parallel version of the software can be easily parametrized.

\subsubsection{Online Dynamic Pricing}

In order to collect data from actors under correct conditions we replicate a naive and simple dynamic pricing algorithm which runs in the background during the experiments.
The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$. The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):

\begin{equation}
\hat{p}_i = \begin{cases}
p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \theta_{\text{high}} \\
p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \theta_{\text{low}} \\
p_{0,i} & \text{otherwise}
\end{cases}
\quad \forall i \in \{1, \ldots, N\}
\end{equation}

where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\theta_{\text{high}}, \theta_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.

For our offline experimental setting, we generalize a master value function that can encompass different demand estimation and pricing strategies.

\begin{align}
V(\cdot) = \max_{p_t} \min_{Q \in \mathcal{U}(\hat{d})}{\mathbb{E}_{d\sim Q} [p_t \times d(p_t, x_t ; \theta) + \psi V_{t+1}(\cdot)]}
\end{align}

We evaluate different substitutions of this objective, which later serve as hyperparameters in the simulator.

\subsection{Experimental Design}

We start from a practical constraint: we do not have access to proprietary production data. Because of that, we design our own fictional platform that still represents how commercial platforms work in the real world. The design comes from a survey of hotel and airline websites, where we extracted common interface components and used them as a high-level template for dynamic pricing environments.

The interface is organized as a product catalog where each product belongs to a time-bounded price vector (for example, a daily pricing period). During each period we collect interaction data by instrumenting UI components and predefined action templates that are still customizable. This gives us control without losing realism.

Since users act with motivations, we define a pool of tasks (jobs to be done) and assign tasks randomly to participants. A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor.

To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.

Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to separate classes $y \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.

Our process follows three stages: (1) observe and \textit{vectorize} behavioral interactions, (2) learn separability to characterize human versus agent patterns, and (3) use the learned signal to train a defensive policy in a controlled dynamic-pricing simulator.

\begin{figure}[ht]
  \resizebox{\columnwidth}{!}{%
    \input{chapters/loop_figure.tex}
  }
  \caption{Overview of the Dynamic Pricing Tasks.}
\end{figure}

Our web platform (developed in similar spirit to RecSim \parencite{ie_recsim_2019}) gives us a controlled environment where tasks are assigned to human and agentic actors and then executed. Each actor receives a browser-level experiment identifier that may persist across multiple session IDs. We then group by experiment and extract session trajectories using the schema below.

To speak to realism, user interviews reported that the platform architecture mirrored standard booking interfaces and reduced the cognitive load required to learn the system. One participant described the flow as ``intuitive'' and close to a ``normal'' transaction, suggesting observed behavior was primarily driven by pricing treatment rather than interface novelty.

The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. This is comforting because the controlled setup still produces commercially relevant interaction data.


\subsubsection{Design of Training Factorial Study}

The simulator has multiple configurable factors, including valuation distributions, demand parametrization, contamination ratio, and policy settings. We therefore design a multi-factor study (current grid estimate: $4\times4\times3\times2\times2$). While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable and logged with services provided by weights and biases.

\subsubsection{Interaction Schema}

We extend the basic event tuple $e_{s,k}$ to capture the full observational signal available to the platform. An interaction event is defined as the extended tuple:
\begin{equation}
e_{s,k} = \left( a_{s,k}, \, i_{s,k}, \, t_{s,k}, \, \mu_{s,k}, \, \delta_{s,k} \right)
\end{equation}
where $\mu_{s,k} \in \mathcal{M}$ is a metadata record containing action-specific context (e.g., price observed, filter parameters, element text), and $\delta_{s,k} \in \mathbb{R}_+$ is the dwell time in milliseconds for attention-based actions.

A session $s$ is itself a structured record:
\begin{equation}
s = \left( \text{sid}, \, \text{eid}, \, t_0, \, \phi, \, \mathcal{U}, \, \tau_s \right)
\end{equation}
where $\text{sid}$ is a unique session identifier (UUID), $\text{eid}$ optionally links to an experiment, $t_0$ is the session start timestamp, $\phi \in \{\texttt{hotel}, \texttt{airline}\}$ denotes the platform mode, $\mathcal{U}$ is the user-agent string, and $\tau_s$ is the trajectory of events.

The action space $\mathcal{A}$ is partitioned into four semantic categories based on the behavioral signal each action conveys:

\begin{table}[ht]
\centering
\caption{Action space partition $\mathcal{A} = \mathcal{A}_{\text{nav}} \cup \mathcal{A}_{\text{cart}} \cup \mathcal{A}_{\text{filter}} \cup \mathcal{A}_{\text{dwell}}$ with signal interpretation.}
\label{tab:action_space}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{Category} & \textbf{Actions} & \textbf{Signal} & $\boldsymbol{\omega}$ \\
\midrule
$\mathcal{A}_{\text{cart}}$ & \texttt{add\_item}, \texttt{remove}, \texttt{checkout}, \texttt{purchase} & Purchase intent & High \\
$\mathcal{A}_{\text{dwell}}$ & \texttt{hover\_title}, \texttt{hover\_paragraph}, \texttt{hover\_link} & Sustained attention & Medium \\
$\mathcal{A}_{\text{nav}}$ & \texttt{page\_view}, \texttt{view\_item}, \texttt{learn\_more} & Discovery & Low \\
$\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{filter\_price}, \texttt{sort} & Preference refinement & Lowest \\
\bottomrule
\end{tabular}
\end{table}

This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.

In the simulator baseline this order is encoded with a compact fixed scale: cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$. Unknown actions are mapped by prefix heuristics to the nearest category.

The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.

In addition to behavioral events, the platform logs price observations to a separate Kafka topic. Each price query generates a record $(i, p, \text{sid}, \phi, t)$ associating the product, displayed price, requesting session, platform mode, and timestamp. This dual-stream architecture enables joint analysis of price exposure and behavioral response.


\subsection{Generative Contamination and Separability}

To train a robust pricing learner, we need a simulator that can generate realistic interaction data under controlled contamination. We build this from Phantom data using a two-stage approach.


\subsubsection{GOFAI-Based Separability}
We use Good Old-Fashioned AI (GOFAI) heuristics to generate weak labels for separability. A set of rule-based predicates $\phi_j: \tau \to \{0,1\}$ partitions dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$. We then estimate separate transition models for both groups and ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?

To answer this, we compute average KL divergence between transition probability matrices. This statistic gives global separability and event-level diagnostics at the same time. In our balanced dataset (50\% human, 50\% agent), the average divergence is approximately $1.8$.

\begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
\begin{equation}
  D_{\mathrm{KL}}(P_e \parallel Q_e) = \sum_{k \in \mathcal{S}_e} P_e(k) \log \frac{P_e(k)}{Q_e(k)}
\end{equation}
where $\mathcal{S}_e$ denotes the set of destination events that follow $e$ in the human trajectories.
\end{definition}

To obtain this statistic, we aggregate transitions by triggering event $e$ and treat normalized outgoing probabilities as categorical distributions $P_e$ (human) and $Q_e$ (agent). We intersect shared event labels, then accumulate log-ratio contributions over shared destinations. Large contributions, including near-zero $Q_e(k)$ cases, identify transitions where one actor class is difficult to mimic.

With these divergence features we train a contrastive model to estimate a weak agent probability $f(\tau)\in[0,1]$, which we later use as a weighting and control signal.


\subsubsection{Transition Probability Estimation}
\label{sec:tpe}


For both subsets, we model session dynamics as an MDP and estimate transition kernel $\mathcal{T}$. For each actor type we estimate global kernels $\hat{\mathcal{T}}_A$ and $\hat{\mathcal{T}}_H$, then cluster into behavioral sub-kernels $\hat{\mathcal{T}}_y^i$ to avoid collapsing all behavior into one average profile. Transition probabilities are estimated by maximum likelihood:
\begin{equation}
    \hat{P}(s' \mid s) = \frac{N(s, s')}{\sum_{k \in \mathcal{S}} N(s, k)}
\end{equation}
where $N(s, s')$ is the observed transition count. This allows us to construct a \textit{Contamination Generator} $\mathcal{G}(\alpha)$. Given a clean trajectory dataset, $\mathcal{G}$ injects synthetic agent trajectories sampled from $\hat{\mathcal{T}}_A$ until the effective mixing ratio reaches $\alpha$.

To scale this to catalog-level pricing, we expand the base event transition matrix from $T\times T$ into product-specific transitions using the current demand condition. In practice, we normalize the demand vector across products and use it to weight how much transition mass each product pair receives. Concretely, each cell of the base matrix becomes an $N\times N$ block (for $N$ products), so the transition matrix grows from $T\times T$ to $(T\cdot N)\times(T\cdot N)$. Finally, we add $C$ generic states (homepage, login, checkout terminal states), which gives the full kernel size $(T\cdot N + C)\times(T\cdot N + C)$.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\textwidth]{chapters/mdp_human.pdf}
    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for human actions.}
    \label{fig:human_mdp_viz}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\textwidth]{chapters/mdp_agent.pdf}
    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{agent} behavior profiles. The state space and transition probabilities are learned from observed session trajectories to enable generative contamination.}
    \label{fig:agent_mdp_viz}
  \end{figure}


\subsection{Second-Stage Classification}
After contamination, we run a second classification stage. We remap events into a semantically aligned feature space, apply richer feature engineering, and retrain to obtain cleaner label probabilities across the full dataset. This classifier is then used directly in the reinforcement-learning reward structure.

Now might be a good time to stand up and go for a quick walk before returning to the rest of this paper.


\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}

We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.

Because contamination level $\alpha$ and demand shift are non-stationary online, a simple error term is not enough. We therefore use a Distributionally Robust Optimization objective. Let $\tau'$ be a newly observed trajectory generated by an unknown actor profile (sampled from the behavioral models in Section~\ref{sec:tpe}). We need a demand mapping conditioned on price and trajectory, $\hat{Q}(p,\tau')$. For each $\tau'$, we compute $\hat{\mathcal{T}}'$ and compare it with controlled baselines $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$:

\begin{align}
  \label{eq:delta_H}
  \Delta_H &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
  \label{eq:delta_A}
  \Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
\end{align}

This yields two centroid-like heuristics that guide contamination estimation at session granularity.

In implementation, we maintain an alternating game-history stack (our \textit{Limbo} stack) and execute it explicitly every epoch with exactly two transitions: first the platform publishes a price vector (leader move), then the market responds with trajectory-derived demand (follower move).

% Mention discretized action space and the clipping and over shotting in continuous action spaces

\subsubsection{Ambiguity Set Construction}
We define an ambiguity set $\mathcal{U}_\epsilon(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face:
\begin{equation}
\mathcal{U}_\epsilon(\hat{P}_N) = \left\{ Q \in \mathcal{P}(\Xi) : W_p(Q, \hat{P}_N) \le \epsilon \right\}
\end{equation}
This set captures all distributions that are statistically close to our observed training data but allows for adversarial shifts.

For the current engine baseline, we use a compact inner-robust approximation by applying ambiguity over contamination in a local interval around nominal contamination $\alpha_0$:
\begin{equation}
\mathcal{A}_{\epsilon_\alpha}(\alpha_0)=\left\{\alpha\in[0,1]:\lvert\alpha-\alpha_0\rvert\le\epsilon_\alpha\right\}
\end{equation}
and we evaluate a small fixed grid in $\mathcal{A}_{\epsilon_\alpha}(\alpha_0)$ per step, selecting the worst-case candidate for the learner.

\subsubsection{The Min-Max Objective}
The robust policy $\pi^*$ is obtained by solving the maximin problem:
\begin{equation}
\label{eq:robust_policy}
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') \right]
\end{equation}
where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty.

In practice, we parameterize this with a session-level leakage term:
\begin{equation}
\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot \text{InfoValue}(p,\tau')
\end{equation}
where $f(\tau')$ is the weak agent probability and $\text{InfoValue}$ is implemented either as a constant query-tax surrogate or as a revelation surrogate $-\log\pi(p\mid\tau')$.

For the baseline engine reported here, we intentionally use the constant query-tax surrogate to keep the mechanism minimal:
\begin{equation}
r_t = R(p_t,\tilde q_t) - \lambda\,f(\tau_t')\,c_{\text{info}}
\end{equation}
with fixed $c_{\text{info}}>0$.


Another possible extension is to adapt the ambiguity radius online, e.g., $\epsilon(\Delta_H)$, so the Wasserstein ball changes with live divergence. We keep this as future work and retain a fixed-radius setup because Wasserstein ambiguity already handles heavy-tail and ``black swan'' behavior without absolute continuity assumptions \parencite{kuhn_wasserstein_2024}.

\subsubsection{Actor Implementation}
In our simulation, the ``follower'' is implemented as a set of Actors. Each Actor is initialized with a type $\theta$ which samples a specific demand curve $d(p; \theta)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.

Practical implementation of browser agents is a strongly evolving field with near-weekly releases of SOTA architectures. In this thesis implementation we abstract that layer into trajectory generators learned from observed human/agent transition kernels.


As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. In the current baseline it is not injected into the core reward; it is tracked separately to compare policy trade-offs.

\begin{figure}[ht]
  \centering
  \resizebox{0.5\columnwidth}{!}{%
    \input{chapters/balance_figure.tex}
  }
  \caption{Introducing the UX index allows us to better distinguish the kind of impact different methods have and allows us to compare them on this Pareto-like scale.}
\end{figure}

We also consider taxation-like overlays for agent traffic under strategy-proof mechanism design (e.g., Vickrey-Clarke-Groves style rules). This remains an extension path and is not part of the main implementation in this thesis.

\subsubsection{Pricing Mechanism Summary}

We now present the complete pricing mechanism that integrates the behavioral separability, contamination estimation, and robust optimization components developed in the preceding sections. Algorithm~\ref{alg:phantom_loop_clean} formalizes the defensive pricing loop as a Stackelberg game where the platform (leader) sets prices and the aggregate demand (follower) responds through observed session trajectories.

\begin{algorithm}[t]
\caption{PHANTOM defensive pricing loop}
\label{alg:phantom_loop_clean}
\DontPrintSemicolon
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}

\Input{catalog size \(N\); costs \(c\); reference prices \(p^{ref}\); behavior models \(\bar T_H,\bar T_A\);
action weights \(\omega\); penalty \(\lambda\); nominal contamination \(\alpha_0\); ambiguity radius \(\epsilon_\alpha\);
candidate count \(K\); horizon \(T\); sessions per step \(M\)}
\Output{price/demand trajectory \(\{(p_t,\hat Q_t,\hat\alpha_t)\}_{t=0}^{T-1}\)}

Initialize contamination estimate \(\hat\alpha \leftarrow 0.2\)\;

\For{\(t \leftarrow 0\) \KwTo \(T-1\)}{

  set \(p_t \leftarrow \pi(\cdot) \) %c + (1 - \kappa \hat\alpha)\,(p^{ref}-c)\)\;
  and clip \(p_t\) to a feasible range (e.g., near cost up to a max margin)\;


  \(\hat Q_t \leftarrow 0\), \(\mathcal S_t \leftarrow \emptyset\); \tcp{Observe sessions and compute demand proxy (Eq.~2)}
  \For{\(m \leftarrow 1\) \KwTo \(M\)}{
    sample a session trajectory \(\tau_m\) using \(\bar T_H\) or \(\bar T_A\)\;
    \(\hat Q_t \leftarrow \hat Q_t + \sum_{k}\omega(a_{m,k})\)\;
    \(\mathcal S_t \leftarrow \mathcal S_t \cup \{\tau_m\}\)\;
  }

  \tcp{Estimate contamination from behavioral separability}
  compute \(\hat\alpha \leftarrow \frac{1}{M}\sum_{\tau\in\mathcal S_t} \Big[\sigma\big(\beta(\Delta_H(\tau)-\Delta_A(\tau))\big)\Big]\)\;

  \tcp{Inner robust step over local ambiguity interval}
  define \(\mathcal{A}_{\epsilon_\alpha}(\alpha_0)\) and sample \(K\) candidates\;
  pick \(\alpha_t^* \leftarrow \arg\min_{\alpha\in\mathcal{A}_{\epsilon_\alpha}(\alpha_0)} \Big[\text{Revenue}(p_t,\hat Q_t^{\alpha}) - \lambda\cdot \text{COI}_{\text{leak}}(p_t,\tau_t^{\alpha})\Big]\)\;

  compute \(J_t \leftarrow \text{Revenue}(p_t,\hat Q_t^{\alpha_t^*}) - \lambda\cdot \text{COI}_{\text{leak}}(p_t,\tau_t^{\alpha_t^*})\)\;
}
\end{algorithm}


The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform publishes prices (leader move), observes resulting session trajectories (follower response), and updates contamination estimates based on divergence from learned human and agent kernels $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$. The history buffer $\mathcal{L}$ (``Limbo'' in our implementation) enforces the alternating Stackelberg structure by preserving the temporal sequence of price publications and demand observations.

%The defensive price update in Line 24 implements contamination-aware margin shrinkage: as estimated contamination $\hat{\alpha}_t$ rises, the margin $(p^{\mathrm{ref}} - c)$ is reduced by factor $\kappa\in[0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring feasibility. In subsequent experiments this heuristic rule is replaced by DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}.