\section{Methodology}

% Extra notes and clarifications: we observed some humans and get their transition probabilities between event types
% We modify behavioral profiles of transition matrices with price elasticity matrices generated by sample valuations of a distributing.

This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven distinguishability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.

\subsection{Problem Formalization}

We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.

Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
\begin{equation}
e_{s,k} = (a_{s,k}, i_{s,k}, t_{s,k})
\end{equation}
where:
\begin{itemize}
    \item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
    \item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
    \item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
\end{itemize}

The platform does not directly observe the true underlying demand function $d(p)$ where $d \in \mathbb{R}^{+}$ and our proxy $\hat{q} \in \mathbb{R}^{+}$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
\begin{equation}
\label{eq:qhat}
\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbf{1}[i_{s,k} = i]
\end{equation}
where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay.

In the current engine implementation, we use the normalized variant of this proxy for each step:
\begin{equation}
\tilde q_{t,i} = 100 \cdot \frac{\hat q_{t,i}}{\sum_{j=1}^{N}\hat q_{t,j} + \varepsilon}
\end{equation}
with fixed category-level weights (cart, dwell, nav, filter) following the same rank order from Table~\ref{tab:action_space}. This keeps the signal dense and directly usable in the simulator.

\subsubsection{Actor Types and Demand Curves}
We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y_s}$. This type determines the actor's demand response function $d\!\left(p \mid Y_s,\theta\right)$, sampled from a distribution of possible demand curves. In compact form, demand remains price-dependent as $d(p\mid Y=y)$. The total observed demand is a stochastic process governed by the naively defined mixture:
\begin{equation}
\label{eq:mixture_demand}
Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p\mid Y=H,\theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p\mid Y=A,\theta)] + \epsilon_t
\end{equation}
where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.
We address that the composition of two non-stationary variables can cause difficulty distinguishing the sources of possible dynamic composition in online environments, whether from market noise or agents specifically.
Accounting for behavioral and market variation, we also treat $\epsilon_t$ as absorbing serving-path variability from LLM infrastructure (e.g., batch-size-dependent inference behavior under changing load), which appears stochastic at the request level even under greedy decoding \parencite{horace_he_and_thinking_machines_lab_defeating_2025}.


\subsection{Cost of Information (COI) Framework}

The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. The intuition behind this being a cost comes from the perspective of the user who is interacting with the platform, where the user is the one incurring that ``cost.'' COI measures the revenue at risk when information asymmetry collapses.
A top-level view in the current AI discourse is that sufficiently large productivity gains can induce vertical deflation through cost compression and supply expansion \parencite{rachitsky_marc_2026}. Our contribution is narrower and mechanism-level: even under long-run deflation, platform revenue still depends on short-run information costs to the user. We formalize that rent as the Cost of Information (COI) and study how agentic reconnaissance accelerates its erosion.

\begin{definition}[Cost of Information]
Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
\begin{equation}
\text{COI} = \mathbb{E}[P] - \underline{p}
\end{equation}
where $\mathbb{E}[P]$ is the expected price charged by the policy and $\underline{p}$ is the minimum viable price (marginal cost).
% Alternative survival function representation (used in proof):
% COI = \int_{\underline{p}}^{\bar{p}} (1 - F_\pi(p)) \, dp
% where F_\pi(p) is the CDF of prices generated by \pi
\end{definition}

\begin{figure}[ht]
    \centering
    \begin{tikzpicture}[scale=1.2]
        % Define the Gaussian function: centered at 2
        \def\bellcurve(#1){1.5 * exp(-0.5*((#1-2)/0.6)^2)}

        % Draw the main axis
        \draw[->, thick] (0, 0) -- (4.5, 0) node[right] {$p$};
        \draw[->, thick] (0, 0) -- (0, 2) node[above] {Density};

        \draw[thick, smooth, samples=100] plot[domain=0:4] (\x, {\bellcurve(\x)});
        \node at (3.2, 1.2) {$f_\pi(p)$};

        % Define p_min and E[p]
        \def\pmin{0.8}
        \def\mean{2}

        % Vertical lines
        \draw[dashed] (\pmin, 0) -- (\pmin, 2.0);
        \draw[dashed] (\mean, 0) -- (\mean, 2.0);

        % Labels on axis
        \node[below] at (\pmin, 0) {$\underline{p}$};
        \node[below] at (\mean, 0) {$\mathbb{E}[p]$};

        \draw[<->, thick, red] (\pmin, 2.0) -- (\mean, 2.0) node[midway, above] {COI};

    \end{tikzpicture}
    \caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$. The abstraction we assume is that the reservation price $\underline{p}$ already has some innate margin and would always result in at least a break-even transaction.}
    \label{fig:coi_illustration}
\end{figure}

We now formally demonstrate that standard dynamic pricing mechanisms are not incentive-compatible with high-frequency agentic traffic. As the number of independent competitive agents $N$ querying the system grows, the platform's ability to sustain a COI vanishes.

\paragraph{Assumption Scope}
The theorem and core experiments in this thesis assume a non-collusive independent-session setting: each agent queries prices independently and does not share sampled quotes across agents. Collusive coordination is outside the current proof scope and is treated as an extension scenario.

\begin{theorem}[COI Erosion in the Limit]
Let $N$ be the number of independent, utility-maximizing agents querying the platform. Let $p_{(1)}$ be the first order statistic (minimum) of the prices offered to these agents. As $N \to \infty$, the Cost of Information converges to 0.
\end{theorem}


\begin{proof}
Consider $N$ independent agents querying the platform, each receiving a price sample $p_i$ drawn from the pricing policy's distribution $F(p)$ bounded by $[\underline{p}, \bar{p}]$. A strategic agent conducting reconnaissance will select the minimum observed price: $p_{(1)} = \min(p_1, \ldots, p_N)$.
% support here means that its the range of possible outputs.
The probability that the minimum price exceeds some threshold $t$ is:
\begin{equation}
P(p_{(1)} > t) = P(\text{all } p_i > t) = [1 - F(t)]^N
\end{equation}

For any price $t > \underline{p}$, the CDF satisfies $F(t) > 0$, so $1 - F(t) < 1$. As $N$ grows, this probability decays exponentially: $[1 - F(t)]^N \to 0$.

The expected minimum price can be written as:
\begin{equation}
\mathbb{E}[p_{(1)}] = \underline{p} + \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt
\end{equation}

Since the integrand vanishes as $N \to \infty$ for all $t > \underline{p}$, the integral converges to zero. Therefore:
\begin{equation}
\lim_{N \to \infty} \text{COI} = \lim_{N \to \infty} (\mathbb{E}[p_{(1)}] - \underline{p}) = 0
\end{equation}
\end{proof}


This result implies that standard pricing policies $\pi$ cannot extract the same surplus under large-scale agentic search without additional structure, which motivates the robust control layer below.

% The DRO objective creates a lower bound on COI extraction, effectively guaranteeing a minimum margin even in the presence of adversarial agents. we need to prove this and demonstrate that in a theorem.


%Mathematical demonstration and validation of the COI and citation backed evidence, and framework overview + show harm to user via other cost distortions. Maybe split into 3.2.1 (COI Theory) and 3.2.2 (Framework Design)

\subsection{System Architecture: Hybrid Kappa-Lambda Architecture}

In order for our research to have grounding in interactions we built a robust e-commerce web-platform. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. Our analysis revealed a clear industry standard: while both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment-level argument which adjusts the proxy routing with custom middleware in Next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application.

The architecture begins with deployed web applications posting interaction data to a backend that stores each record in Apache Kafka. Kafka acts as the reservoir linking sessions to experiments. Behavioral events and, separately, price observations from the pricing-provider microservice (invoked by the frontend) land in Kafka topics. A scheduled Airflow pipeline (with manual triggers) consumes the stream; the final pricing stage writes vectors to Redis for low-latency reads by the provider and display in the client. The pattern is deliberately standard---Kafka for durability and replay, Redis for serving---so the same skeleton applies across e-commerce settings. We invested in this stack to keep runs reproducible and to limit extraneous variance.

\paragraph{Public Web Artifact} We transition the Kappa like architecture of the data collection to a Lambda architecture for actual learning in a surrogate environment. This allows us to move faster on data which is provided and helps us create a feedback loop for production deployment. To support further research in this intersection of fields we release P4P \footnote{\url{https://github.com/velocitatem/p4p}} as a public repository providing the interaction layer of the PHANTOM framework. This provides a configurable storefront which can be tailored to any commercial setting with a standardized session-level event tracking. We document the API adapters and expected schemas for pricing providers and log ingestion services. The repository is intended for controlled experimentation and method replication rather than production commerce deployment.

\paragraph{Public Dataset} For reproducibility of the behavioral analysis and distinguishability experiments, we also release the interaction dataset used in this thesis as \textit{WhoClickedIt}. The dataset is hosted on Hugging Face \footnote{\url{https://huggingface.co/datasets/velocitatem/whoclickedit}} and is distributed as one flattened event sheet (\texttt{whoclicked.csv}) with explicit labels (\texttt{actor\_type}, \texttt{is\_agent}, and \texttt{record\_type}). The dataset card on that page documents the schema, collection process, and known limitations.


\subsubsection{DevOps Principles}

Reproducibility guided deployment choices: the stack runs locally or on common cloud providers. New interaction modalities follow a small template; middleware reads environment variables so parallel deployments (e.g.\ staging versus production-like experiments) differ only in configuration, not in forked codebases.

\subsubsection{Online Dynamic Pricing}

To expose participants to state-dependent prices without over-constraining the study, we run a transparent surge--discount heuristic in the background during data collection.
The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$.


The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):

\begin{equation}
\hat{p}_i = \begin{cases}
 p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \varrho_{\text{high}} \\
 p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \varrho_{\text{low}} \\
 p_{0,i} & \text{otherwise}
\end{cases}
\quad \forall i \in \{1, \ldots, N\}
\end{equation}

where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\varrho_{\text{high}}, \varrho_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.

% For our offline experimental setting, we generalize a master value function that can encompass different demand estimation and pricing strategies.
%
% \begin{align}
% V(\cdot) = \max_{p_t} \min_{Q \in \mathcal{U}(\hat{d})}{\mathbb{E}_{d\sim Q} [p_t \times d(p_t, x_t ; \theta) + \psi V_{t+1}(\cdot)]}
% \end{align}
%
% We evaluate different substitutions of this objective, which later serve as hyperparameters in the simulator.

\subsection{Experimental Design}

We start from a practical constraint: we do not have access to proprietary production data. Because of that, we design our own fictional platform that still represents how commercial platforms work in the real world. The design comes from a survey of hotel and airline websites, where we extracted common interface components and used them as a high-level template for dynamic pricing environments.

The interface is organized as a product catalog where each product belongs to a time-bounded price vector (for example, a daily pricing period). During each period we collect interaction data by instrumenting UI components and predefined action templates that are still customizable. That yields controlled variation while keeping the interface credible.

Since users act with motivations, we define a pool of tasks (jobs to be done) and assign tasks randomly to participants.
We discuss limitations and choices made in this experimental design in Section~\ref{sec:limitations_risks}.
The task pool is stored as a structured table with fields \texttt{id}, \texttt{created\_at}, \texttt{task\_name}, \texttt{task\_description}, and \texttt{task\_def\_of\_done}. We formulate the tasks as compact jobs-to-be-done rather than as strict click scripts, because the target is to elicit realistic browsing and comparison behavior which can capture nuance of different people. In hotel mode the assigned tasks include \textit{Cheapest Room}, \textit{Cheapest Room w/ View}, \textit{MultiStep Cheapest Room}, \textit{The Digital Nomad (Executive)}, and \textit{The 3-Way Tradeoff (Desk + Quiet + Flexible)}. These prompts deliberately require critical thought in search, inspection of room details, comparison of amenities or images, return visits to the listing page, and a final booking decision which create a degree of cognitive load. In airline mode we use \textit{Last-Minute One-Way Flight}, where the actor must urgently travel to LAX from either SEA or JFK within the next 1--3 days, inspect at least a small set of candidate itineraries, and then book a reasonable earliest departure.
A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor.

The human data collection involved 13 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 13 human sessions we ran 16 agent sessions of equivalent task scope, yielding 29 labeled trajectories in total (45\% human, 55\% agent). Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.

To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.

Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to distinguish classes $Y \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.

Our process follows three stages: (1) observe and \textit{vectorize} behavioral interactions, (2) learn distinguishability to characterize human versus agent patterns, and (3) use the learned signal to train a defensive policy in a controlled dynamic-pricing simulator.

Figure~\ref{fig:phantom_unified_architecture} summarizes the full mechanism from online interaction capture to divergence-based contamination scoring and robust control of pricing decisions.

\begin{figure}[ht]
  \centering
  \resizebox{\textwidth}{!}{%
    \input{chapters/hero_architecture_figure.tex}
  }
  \caption{Unified PHANTOM defense architecture. (a) Online serving and logging with behavioral and price-query streams. (b) Distinguishability layer that estimates KL divergence to human/agent prototypes and derives session-level contamination scores. (c) Distributionally robust pricing control that optimizes under an ambiguity set while penalizing COI leakage and tracking UX cost.}
  \label{fig:phantom_unified_architecture}
\end{figure}

\begin{figure}[ht]
  \resizebox{\columnwidth}{!}{%
    \input{chapters/loop_figure.tex}
  }
  \caption{Overview of the Dynamic Pricing Tasks.}
\end{figure}

Our web platform (developed in similar spirit to RecSim \parencite{ie_recsim_2019}) gives us a controlled environment where tasks are assigned to human and agentic actors and then executed. Each actor receives a browser-level experiment identifier that may persist across multiple session IDs. We then group by experiment and extract session trajectories using the schema below.

To speak to realism, user interviews reported that the platform architecture mirrored standard booking interfaces and reduced the cognitive load required to learn the system. One participant described the flow as ``intuitive'' and close to a ``normal'' transaction, suggesting observed behavior was primarily driven by pricing treatment rather than interface novelty.

The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. The responses match what one expects from live commerce: sharp reactions to volatility and to list--checkout gaps, which supports external validity despite the lab setting.


\subsubsection{Design of Training Factorial Study}

The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.

Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.

\begin{table}[ht]
\centering
\caption{Compact comparison of TPU generations used in the training stack.}
\label{tab:tpu_specs}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{Feature} & \textbf{TPU v4} & \textbf{TPU v5e} & \textbf{TPU v6e (Trillium)} \\
\midrule
Peak BF16 per chip (TFLOPS) & 275 & 197 & 918 \\
HBM capacity per chip (GB) & 32 & 16 & 32 \\
HBM bandwidth per chip (GB/s) & 1200 & 819 & 1600 \\
TensorCores per chip & 2 & 1 & 1 \\
Interconnect topology & 3D mesh/torus & 2D torus & 2D torus \\
Max pod size (chips) & 4096 & 256 & 256 \\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[ht]
\centering
\caption{TPU allocation used for the factorial study.}
\label{tab:tpu_allocation}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{TPU Type} & \textbf{Total Chips} & \textbf{Zone(s)} & \textbf{Provisioning} \\
\midrule
v6e & 128 (64 + 64) & europe-west4-a, us-east1-d & Spot \\
v5e & 128 (64 + 64) & us-central1-a, europe-west4-b & Spot \\
v4 & 64 (32 + 32) & us-central2-b & 32 Spot + 32 On-demand \\
\bottomrule
\end{tabular}
\end{table}

For connections from Madrid, we prioritize the europe-west4 allocation for latency-sensitive runs with the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. % TODO: cite this (from bib)
Hardware specifications are from the official Google Cloud TPU documentation \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}.

Training images follow Docker layer caching: dependency layers are separate from the copy of application source so routine code edits do not invalidate the entire build; only changes to the training entrypoint or dependencies force a full rebuild.

TPU capacity is scarce and often preemptible, so we rely primarily on on-demand pods for workloads that must finish without interruption. A typical reservation is a 32-chip pod across four worker VMs; that layout already gives enough parallelism for our sweep driver without adding a separate cluster scheduler. We considered SLURM-style job arrays, but fluctuating provisioning times would have added operational overhead with little benefit for our workload, so orchestration stays in the container and Ray layer described below.

\subsubsection{Interaction Schema}

We extend the basic event tuple $e_{s,k}$ to capture the full observational signal available to the platform. An interaction event is defined as the extended tuple:
\begin{equation}
e_{s,k} = \left( a_{s,k}, \, i_{s,k}, \, t_{s,k}, \, \mu_{s,k}, \, \delta_{s,k} \right)
\end{equation}
where $\mu_{s,k} \in \mathcal{M}$ is a metadata record containing action-specific context (e.g., price observed, filter parameters, element text), and $\delta_{s,k} \in \mathbb{R}_+$ is the dwell time in milliseconds for attention-based actions.

A session $s$ is itself a structured record:
\begin{equation}
s = \left( \text{sid}, \, \text{eid}, \, t_0, \, \phi, \, \mathcal{U}, \, \tau_s \right)
\end{equation}
where $\text{sid}$ is a unique session identifier (UUID), $\text{eid}$ optionally links to an experiment, $t_0$ is the session start timestamp, $\phi \in \{\texttt{hotel}, \texttt{airline}\}$ denotes the platform mode, $\mathcal{U}$ is the user-agent string, and $\tau_s$ is the trajectory of events.

The action space $\mathcal{A}$ is partitioned into four semantic categories based on the behavioral signal each action conveys:

\begin{table}[ht]
\centering
\caption{Action space partition $\mathcal{A} = \mathcal{A}_{\text{nav}} \cup \mathcal{A}_{\text{cart}} \cup \mathcal{A}_{\text{filter}} \cup \mathcal{A}_{\text{dwell}}$ with signal interpretation.}
\label{tab:action_space}
\begin{tabular}{@{}llll@{}}
\toprule
\textbf{Category} & \textbf{Actions} & \textbf{Signal} & $\boldsymbol{\omega}$ \\
\midrule
$\mathcal{A}_{\text{cart}}$ & \texttt{add\_item}, \texttt{remove}, \texttt{checkout}, \texttt{purchase} & Purchase intent & High \\
$\mathcal{A}_{\text{dwell}}$ & \texttt{hover\_title}, \texttt{hover\_paragraph}, \texttt{hover\_link} & Sustained attention & Medium \\
$\mathcal{A}_{\text{nav}}$ & \texttt{page\_view}, \texttt{view\_item}, \texttt{learn\_more} & Discovery & Low \\
$\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{filter\_price}, \texttt{sort} & Preference refinement & Lowest \\
\bottomrule
\end{tabular}
\end{table}

This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
The ordering cart $>$ dwell $>$ nav $>$ filter is a deliberate simplification: we set it from early data by ranking categories by KL divergence between human and agent transition rows and then spacing weights in powers of two. The simulator encodes cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$; unknown actions map by prefix to the nearest category.

The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.

In addition to behavioral events, the platform logs price observations to a separate Kafka topic. Each price query generates a record $(i, p, \text{sid}, \phi, t)$ associating the product, displayed price, requesting session, platform mode, and timestamp. This dual-stream architecture enables joint analysis of price exposure and behavioral response.


\subsection{Generative Contamination and Distinguishability}

To train a robust pricing learner, we need a simulator that can generate realistic interaction data under controlled contamination. We build this from Phantom data using a two-stage approach.


\subsubsection{Ground-Truth Distinguishability}
Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $Y_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels distinguishable enough to justify downstream pricing control that depends on that distinguishability?

For each session $s$ we fit a session-level transition kernel $\hat{\mathcal{T}}_s$, then average KL divergence to the human centroid ($\Delta_{H,s}$) and to the agent centroid ($\Delta_{A,s}$). The distinguishability score is the gap $\Delta_{H,s} - \Delta_{A,s}$ (negative $\approx$ human-like, positive $\approx$ agent-like). KL is used because it compares full categorical rows, not single features.

Gap scores are skewed and nonnegative, so we test cohort differences with a Mann--Whitney $U$ test \parencite{mann_test_1947} rather than a $t$-test. We report $U$, the two-sided $p$-value, and descriptive statistics for each group.

\begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
\begin{equation}
  D_{\mathrm{KL}}(P_e \parallel Q_e) = \sum_{k \in \mathcal{S}_e} P_e(k) \log \frac{P_e(k)}{Q_e(k)}
\end{equation}
where $\mathcal{S}_e$ denotes the set of destination events that follow $e$ in the human trajectories.
\end{definition}
We exploit KL asymmetry so that ``distance from human-like'' is explicit in the score, not only distance from agents.

To obtain this statistic, we aggregate transitions by triggering event $e$ and treat normalized outgoing probabilities as categorical distributions $P_e$ (human) and $Q_e$ (agent). We intersect shared event labels, then accumulate log-ratio contributions over shared destinations. Large contributions, including near-zero $Q_e(k)$ cases, identify transitions where one actor class is difficult to mimic.

With these divergence features we compute a weak agent probability $f(\tau')\in[0,1]$ directly from divergence gaps, which we later use as a weighting and control signal.


\subsubsection{Transition Probability Estimation}
\label{sec:tpe}


For both subsets, we model session dynamics as an MDP and estimate transition kernel $\mathcal{T}$. For each actor type we estimate global kernels $\hat{\mathcal{T}}_A$ and $\hat{\mathcal{T}}_H$, then cluster into behavioral sub-kernels $\hat{\mathcal{T}}_y^i$ to avoid collapsing all behavior into one average profile. Transition probabilities are estimated by maximum likelihood:
\begin{equation}
    \hat{P}(s' \mid s) = \frac{N(s, s')}{\sum_{k \in \mathcal{S}} N(s, k)}
\end{equation}
where $N(s, s')$ is the observed transition count. This allows us to construct a \textit{Contamination Generator} $\mathcal{G}(\alpha)$. Given a clean trajectory dataset, $\mathcal{G}$ injects synthetic agent trajectories sampled from $\hat{\mathcal{T}}_A$ until the effective mixing ratio reaches $\alpha$. The properties of an MDP such as ... should be preserved by the operation described below.

To scale this to catalog-level pricing, we expand the base event transition matrix from $T\times T$ into product-specific transitions using the current demand condition. In practice, we normalize the demand vector across products and use it to weight how much transition mass each product pair receives. Concretely, each cell of the base matrix becomes an $N\times N$ block (for $N$ products), so the transition matrix grows from $T\times T$ to $(T\cdot N)\times(T\cdot N)$. Finally, we add $C$ generic states (homepage, login, checkout terminal states), which gives the full kernel size $(T\cdot N + C)\times(T\cdot N + C)$.
% The validity of this demand-weighted block expansion is still subject to formal proof: it needs to be shown that the resulting matrix retains row-stochasticity (rows summing to 1) and that the weighting by the demand vector preserves the Markov property for the expanded state space. In the engine source this is the target of ongoing validation before the expansion is relied on for behavioral generation at scale.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\textwidth]{chapters/mdp_human.pdf}
    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{human} actions.}
    \label{fig:human_mdp_viz}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\textwidth]{chapters/mdp_agent.pdf}
    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{agent} behavior profiles. The state space and transition probabilities are learned from observed session trajectories to enable generative contamination.}
    \label{fig:agent_mdp_viz}
  \end{figure}


\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}

We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.
% TODO: add canonical Stackelberg citation.

Because contamination level $\alpha$ and demand shift are non-stationary online, a simple error term is not enough. We therefore use a Distributionally Robust Optimization objective. Let $\tau'$ be a newly observed trajectory generated by an unknown actor profile (sampled from the behavioral models in Section~\ref{sec:tpe}). We need a demand mapping conditioned on price and trajectory, $\hat{Q}(p,\tau')$. For each $\tau'$, we compute $\hat{\mathcal{T}}'$ and compare it with controlled baselines $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$:

\begin{align}
  \label{eq:delta_H}
  \Delta_H &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
  \label{eq:delta_A}
  \Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
\end{align}

From these two divergences we define the gap score:
\begin{equation}
g(\tau') = \Delta_H(\tau') - \Delta_A(\tau').
\end{equation}
Positive values indicate trajectories farther from the human centroid and closer to the agent centroid.

We map this gap to a weak agent probability using a temperature-controlled logistic map:
\begin{equation}
f(\tau') = P(Y=A\mid\tau') = \operatorname{softmax}(-\Delta_A,-\Delta_H)_A = \sigma\left(\frac{\Delta_H-\Delta_A}{T}\right), \quad T>0.
\end{equation}
The session-level control signal injected into pricing is then
\begin{equation}
\hat{\alpha}(\tau') = f(\tau').
\end{equation}

This turns distinguishability into an operational control input in the engine. On a per-customer or use-case basis, a similar data collection and fitting process should be repeated to obtain domain-specific behavior kernels.

In implementation we keep an alternating game-history buffer and advance it each epoch with two transitions: the platform publishes a price vector (leader move), then the environment returns trajectory-derived demand (follower move). The codebase names this structure \textit{Limbo}; the appendix lists it under the same label for readers who inspect the repository.

To avoid notation drift, we separate two COI objects used for different purposes:
\begin{align}
\text{COI}_{\text{level}}(\pi) &= \mathbb{E}[P]-\underline{p} \quad \text{(global reporting KPI)} \\
\text{COI}_{\text{leak}}(p,\tau') &= f(\tau')\cdot \text{InfoValue}(p,\tau') \quad \text{(local control penalty)}
\end{align}
where $\text{COI}_{\text{level}}$ is evaluated at policy level and $\text{COI}_{\text{leak}}$ is evaluated per observed quote during training. We connect local leakage to expected global erosion with the operational assumption
\begin{equation}
\mathbb{E}[\Delta\text{COI}_{\text{level},t} \mid \tau_t'] \approx -\kappa\,\text{COI}_{\text{leak}}(p_t,\tau_t') + \xi_t,
\end{equation}
where $\kappa>0$ and $\xi_t$ is residual noise. This keeps theorem-level COI erosion (global, asymptotic) distinct from training-time leakage control (local surrogate).

% Mention discretized action space and the clipping and over shotting in continuous action spaces
% Also talk about catastrophic economics, we add termination on bankrupcy or zero demand so market collaps

\subsubsection{Ambiguity Set Construction}
We define an ambiguity set $\mathcal{U}_\epsilon(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face:
\begin{equation}
\mathcal{U}_\epsilon(\hat{P}_N) = \left\{ Q \in \mathcal{P}(\Xi) : W_p(Q, \hat{P}_N) \le \epsilon \right\}
\end{equation}
This set captures all distributions that are statistically close to our observed training data but allows for adversarial shifts.

For the current engine baseline, we use a compact inner-robust approximation by applying ambiguity over contamination in a local interval around nominal contamination $\alpha_0$:
\begin{equation}
\mathcal{A}_{\epsilon_\alpha}(\alpha_0)=\left\{\alpha\in[0,1]:\lvert\alpha-\alpha_0\rvert\le\epsilon_\alpha\right\}
\end{equation}
and we evaluate a small fixed grid in $\mathcal{A}_{\epsilon_\alpha}(\alpha_0)$ per step, selecting the worst-case candidate for the learner.
% A proper Wasserstein ball implementation over the full demand distribution (rather than a scalar alpha interval) would use the POT library (Python Optimal Transport): compute W_2 between the empirical reference P_hat and each candidate Q using ot.emd2() or ot.sliced_wasserstein_distance() for scalability, then accept only candidates within epsilon. In practice the inner minimization becomes: candidates = [G(alpha) for alpha in linspace]; dists = [ot.emd2(p_hat, q, M) for q in candidates]; worst = candidates[argmin(reward[dists <= epsilon])]. The current grid-on-alpha approximation is a computationally cheap substitute; moving to a true Wasserstein ball would tighten the worst-case guarantee but requires specifying the ground metric M over the demand space.


\subsubsection{Environment Setup for Dynamic Pricing}
The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:oracle_flow}. The Oracle maps historical price and demand state to a new price vector, which is exposed to a distribution of demand curves. Each product generates trajectories weighted by behavioral kernels $\tau_Y$, producing a full transition matrix $\tau'$ over sessions. Sampled trajectories $\{\tau_k\}$ are aggregated through the demand proxy function $Q(\cdot)$ to yield the next demand vector, which feeds back into the Oracle.

\begin{figure}[ht]
\centering
{\setlength{\arraycolsep}{4pt}%
\resizebox{0.85\linewidth}{!}{$
\begin{aligned}
&\text{Oracle}(\vec{p}_{t-1},\vec{\hat{q}})\to
\begin{pmatrix}
p_0\\
p_1\\
\cdots\\
p_N
\end{pmatrix}
\underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}}
\begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix}
\underrightarrow{\vec{d}\otimes \tau_Y}
\begin{bmatrix}
0.01 & 0.02 & \cdots & 0.3 \\
0.41 & 0.24 & \cdots & 0.0 \\
\cdots & \cdots & \cdots & \cdots \\
0.51 & 0.09 & \cdots & 0.1 \\
\end{bmatrix}
\\
&\underrightarrow{\tau_k \sim \tau^\prime}
\{\tau_k\}_{k=0}^K \to \hat{Q}(\tau_k)
\to \begin{pmatrix}
\hat{q}_0 \\
\hat{q}_1 \\
\cdots \\
\hat{q}_N \\
\end{pmatrix}
\to \text{Oracle}(\cdot)
\end{aligned}
$}%
}
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated via the Kronecker product $\vec{d}\otimes\tau_Y$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
\label{fig:oracle_flow}
\end{figure}

\subsubsection{The Min-Max Objective}
The robust policy $\pi^*$ is obtained by solving the maximin problem:
\begin{equation}
\label{eq:robust_policy}
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') \right]
\end{equation}
where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty. We note that $p$ is directly dependent on $\pi$, which is the one deciding this as its action.
Looking at the reward structure, note that we are not subtracting COI but rather the leakage of COI, which is as defined below.


In practice, we parameterize this with a session-level leakage term:
\begin{equation}
\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot \text{InfoValue}(p,\tau')
\end{equation}
where $f(\tau')$ is the weak agent probability and $\text{InfoValue}$ is implemented either as a constant query-tax surrogate or as a revelation surrogate $-\log\pi(p\mid\tau')$.

The inner minimization selects the contamination candidate that makes the penalized reward smallest, so the outer policy update faces the worst plausible leakage scenario inside the ambiguity set rather than an average case.

For the baseline engine reported here, we intentionally use the constant query-tax surrogate to keep the mechanism minimal:
\begin{equation}
r_t = R(p_t,\tilde q_t) - \lambda\,f(\tau_t')\,c_{\text{info}}
\end{equation}
with fixed $c_{\text{info}}>0$.


Another possible extension is to adapt the ambiguity radius online, e.g., $\epsilon(\Delta_H)$, so the Wasserstein ball changes with live divergence. We keep this as future work and retain a fixed-radius setup because Wasserstein ambiguity already handles heavy-tail and ``black swan'' behavior without absolute continuity assumptions \parencite{kuhn_wasserstein_2024}.

\subsubsection{Actor Implementation}
In our simulation, the ``follower'' is implemented as a set of Actors. Each Actor is initialized with a class $Y$ and a latent type $\theta \sim \mathcal{D}_Y$, which samples a specific demand curve $d\!\left(p\mid Y,\theta\right)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.

Practical implementation of browser agents is a strongly evolving field with near-weekly releases of SOTA architectures. In this thesis implementation we abstract that layer into trajectory generators learned from observed human/agent transition kernels.


As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. In code, the UX index is implemented as a volatility penalty on relative price changes, with an extra upward-volatility component weighted by $0.5$ and scaled by $\eta_{\text{ux}}$ and an information-budget term. We also keep a separate supra-competitive penalty tied to persistent price excess above a competitive anchor, which punishes high-price behavior even when volatility is low.
We measure volatility as mean absolute relative price movement, $v_t=\frac{1}{N}\sum_{i=1}^N\bigl|(p_{t,i}-p_{t-1,i})/\max(p_{t-1,i},1)\bigr|$.

\begin{figure}[ht]
  \centering
  \resizebox{0.5\columnwidth}{!}{%
    \input{chapters/balance_figure.tex}
  }
  \caption{Introducing the UX index allows us to better distinguish the kind of impact different methods have and allows us to compare them on this Pareto-efficiency-like scale.}
\end{figure}

We also consider taxation-like overlays for agent traffic under strategy-proof mechanism design (e.g., Vickrey-Clarke-Groves style rules). This remains an extension path and is not part of the main implementation in this thesis.

\subsubsection{Pricing Mechanism Summary}

We now present the complete pricing mechanism that integrates the behavioral distinguishability, contamination estimation, and robust optimization components developed in the preceding sections. Algorithm~\ref{alg:phantom_loop_clean} formalizes the defensive pricing loop as a Stackelberg game where the platform (leader) sets prices and the aggregate demand (follower) responds through observed session trajectories.

\begin{algorithm}[t]
\caption{PHANTOM defensive pricing loop}
\label{alg:phantom_loop_clean}
\DontPrintSemicolon
\SetKwInput{Input}{Input}
\SetKwInput{Output}{Output}

\Input{catalog size \(N\); action scale grid \(\mathcal{S}_{act}\); nominal contamination \(\alpha_0\); ambiguity radius \(\epsilon_\alpha\); candidate count \(K\); horizon \(T\); sessions per step \(M\); behavior kernels \(\bar T_H,\bar T_A\); event weights \(\omega\); COI penalty \(\lambda\)}
\Output{trajectory \(\{(p_t,\hat Q_t,\alpha_t^*)\}_{t=0}^{T-1}\)}
\For{\(t \leftarrow 0\) \KwTo \(T-1\)}{
  observe \(o_t=[\hat Q_{t-1}, p_{t-1}]\)\;
  choose discrete action \(a_t \in \{1,\dots,|\mathcal{S}_{act}|\}\) from policy \(\pi\)\;
  set \(p_t \leftarrow \mathrm{clip}(p_{t-1} \cdot \mathcal{S}_{act}[a_t])\)\;

  define local ambiguity interval \(\mathcal{A}_{\epsilon_\alpha}(\alpha_0)=\{\alpha:\lvert\alpha-\alpha_0\rvert\le\epsilon_\alpha\}\)\;
  \For{\(k \leftarrow 1\) \KwTo \(K\)}{
    set \(\alpha_k \in \mathcal{A}_{\epsilon_\alpha}(\alpha_0)\) from a uniform grid\;
    sample \(M\) sessions from mixture \((1-\alpha_k)\bar T_H + \alpha_k \bar T_A\)\;
    compute demand proxy \(\hat Q_t^{(k)} = \sum_{m=1}^{M}\sum_j \omega(a_{m,j})\,\mathbf{1}[i_{m,j}=i]\)\;
    compute \((\Delta_H^{(k)},\Delta_A^{(k)})\) and session score \(f_t^{(k)}\) from KL divergence\;
    compute candidate reward \(r_t^{(k)} = R(p_t,\hat Q_t^{(k)}) - \lambda\,f_t^{(k)}\,c_{info}\)\;
  }
  choose \(k^* \leftarrow \arg\min_k r_t^{(k)}\), set \(\alpha_t^* \leftarrow \alpha_{k^*}\)\;
  set \(\hat Q_t \leftarrow \hat Q_t^{(k^*)}\), \(r_t \leftarrow r_t^{(k^*)}\)\;
}
\end{algorithm}


The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform applies one discrete multiplicative price action, the environment samples a batch of sessions, and demand is recomputed from weighted events. Robustness is implemented as an inner minimization over a small local grid of contamination candidates around nominal $\alpha_0$, matching the current engine implementation. The history buffer $\mathcal{L}$ enforces the alternating Stackelberg structure by preserving the temporal sequence of price publications and demand observations.

%The defensive price update in Line 24 implements contamination-aware margin shrinkage: as estimated contamination $\hat{\alpha}_t$ rises, the margin $(p^{\mathrm{ref}} - c)$ is reduced by factor $\kappa\in[0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring feasibility. In subsequent experiments this heuristic rule is replaced by DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}.

\subsection{Parallelization Strategy}

To reduce mid-job preemption we standardize on a TPU v4 allocation with 40 chips and five workers. A head process launches Ray \parencite{moritz_ray_2018} and attaches workers across the remaining hosts.

\subsubsection{Computational Cost Analysis of the Simulation Step}
The per-step cost of Algorithm~\ref{alg:phantom_loop_clean} is not uniform across its components. To inform hardware provisioning and to identify where algorithmic improvements are most impactful, we profile the hot path of the engine using Python's \texttt{cProfile} instrumentation over 20 environment steps under two configurations: a baseline with the robustness inner loop disabled ($K=1$, $\epsilon_\alpha=0$) and a standard robust setting ($K=5$, $\epsilon_\alpha=0.2$). Both runs use $M=10$ sessions per market call and $N=3$ products.

The baseline achieves approximately 26 steps per second. Enabling the robustness inner loop with $K=5$ candidates drops throughput to 7.2 steps per second, a $3.6\times$ slowdown that is directly proportional to $K$, consistent with the $O(K)$ scaling of the adversarial alpha selection in the implementation.

\begin{table}[ht]
\centering
\caption{Per-step profiling results (20 steps, $M=10$ sessions, $N=3$ products). Self-time measures time spent inside the function excluding callees; cumulative time includes the full call subtree.}
\label{tab:profile_results}
\begingroup
\small
\setlength{\tabcolsep}{4pt}
\begin{tabular}{@{}lrrrr@{}}
\toprule
\textbf{Function} & \textbf{Calls} & \textbf{Self (ms)} & \textbf{Cum. (ms)} & \textbf{Cum. \%} \\
\midrule
\multicolumn{5}{l}{\textit{Baseline ($K=1$, 0.77\,s total, 26 steps/s)}} \\
\texttt{sample\_behavior\_from\_transitions} & 420 & 131 & 658 & 86\% \\
\texttt{DataFrame.xs} & 4,820 & 30 & 201 & 26\% \\
\texttt{numpy.nan\_to\_num} & 4,904 & 43 & 97 & 13\% \\
\texttt{adjust\_behavior\_to\_condition} & 84 & 3 & 54 & 7\% \\
\midrule
\multicolumn{5}{l}{\textit{Robust ($K=5$, 2.79\,s total, 7.2 steps/s)}} \\
\texttt{sample\_behavior\_from\_transitions} & 1,220 & 519 & 2,447 & 88\% \\
\texttt{DataFrame.xs} & 16,668 & 108 & 729 & 26\% \\
\texttt{numpy.nan\_to\_num} & 16,912 & 164 & 363 & 13\% \\
\texttt{adjust\_behavior\_to\_condition} & 244 & 11 & 108 & 4\% \\
\bottomrule
\end{tabular}
\endgroup
\end{table}

Across both configurations, \texttt{sample\_behavior\_from\_transitions} accounts for 86--88\% of total wall time. The function implements the Markov chain sampler described in Section~\ref{sec:tpe}: at each transition it retrieves the current-state row from the expanded transition \texttt{DataFrame} via label-based indexing, which internally dispatches through the pandas \texttt{xs} and \texttt{fast\_xs} code paths. For $M$ sessions each running up to $L_{\max}=40$ transitions, a single \texttt{market.act()} call issues up to $M \cdot L_{\max}$ individual row lookups. With $K=5$ robustness candidates per outer step this accumulates to $5 \times 10 \times 40 = 2{,}000$ row accesses per outer step, producing the 16k \texttt{xs} invocations observed in Table~\ref{tab:profile_results}.

The \texttt{numpy.nan\_to\_num} calls, accounting for 13\% of self-time, occur once per row lookup to sanitize sampled probability vectors before normalization; their call count therefore tracks the \texttt{xs} count exactly.

\texttt{adjust\_behavior\_to\_condition} expands the base $E \times E$ event transition matrix to a $(E \cdot N) \times (E \cdot N)$ product-specific matrix via a Kronecker product. At $N=3$ this is inexpensive, but the cost scales as $O(E^2 N^2)$, so at the $N=10$ default it becomes a more significant contributor. The result is not cached across the $K$ robustness candidates inside a single outer step, meaning the Kronecker expansion is recomputed $2K$ times per step (once for the human kernel and once for the agent kernel at each candidate $\alpha_k$).

The dominant bottleneck therefore has a clear structural cause: the expanded transition matrix is a string-keyed \texttt{DataFrame}, and pandas object-level indexing carries substantial per-call overhead relative to the arithmetic being performed. Converting the expanded matrix to a \texttt{numpy} array with an accompanying integer state-to-index map, computed once per \texttt{market.act()} call and cached for the duration of the robustness inner loop, eliminates the entire pandas dispatch chain. We leverage this bottleneck identified as an opportunity to squeeze the gap which is left by the computational needs of the pricing learner. We make use of JAX to parallelize on the TPU, and surprisingly we open up a large speedup even on CPU-only compute, improving throughput from 26 to 220 steps/s in the baseline configuration and from 7.2 to 136 steps/s under the full robust inner loop, an 8.5$\times$ and 19$\times$ speedup respectively.