|
|
|
|
@@ -7,7 +7,7 @@ This section details the theoretical and practical framework developed to addres
|
|
|
|
|
|
|
|
|
|
\subsection{Problem Formalization}
|
|
|
|
|
|
|
|
|
|
We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $\theta_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
|
|
|
|
|
We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
|
|
|
|
|
|
|
|
|
|
Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
@@ -20,7 +20,7 @@ where:
|
|
|
|
|
\item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
|
|
|
|
|
The platform does not directly observe the true underlying demand function $d(p)$ where $d \in \mathbb{R}^{+}$ and our proxy $\hat{q} \in \mathbb{R}^{+}$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:qhat}
|
|
|
|
|
\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbf{1}[i_{s,k} = i]
|
|
|
|
|
@@ -34,19 +34,20 @@ In the current engine implementation, we use the normalized variant of this prox
|
|
|
|
|
with fixed category-level weights (cart, dwell, nav, filter) following the same rank order from Table~\ref{tab:action_space}. This keeps the signal dense and directly usable in the simulator.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Actor Types and Demand Curves}
|
|
|
|
|
We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the naively defined mixture:
|
|
|
|
|
We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y_s}$. This type determines the actor's demand response function $d\!\left(p \mid Y_s,\theta\right)$, sampled from a distribution of possible demand curves. In compact form, demand remains price-dependent as $d(p\mid Y=y)$. The total observed demand is a stochastic process governed by the naively defined mixture:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:mixture_demand}
|
|
|
|
|
Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
|
|
|
|
|
Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p\mid Y=H,\theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p\mid Y=A,\theta)] + \epsilon_t
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.
|
|
|
|
|
We address that the composition of two non-stationary variables can cause difficulty distinguishing the sources of possible dynamic composition in online environments, whether from market noise or agents specifically.
|
|
|
|
|
Accounting for behavioral and market variation, we also treat $\epsilon_t$ as absorbing serving-path variability from LLM infrastructure (e.g., batch-size-dependent inference behavior under changing load), which appears stochastic at the request level even under greedy decoding \parencite{horace_he_and_thinking_machines_lab_defeating_2025}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Cost of Information (COI) Framework}
|
|
|
|
|
|
|
|
|
|
The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. COI measures the revenue at risk when information asymmetry collapses.
|
|
|
|
|
The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. The intuition behind this being a cost comes from the perspective of the user who is interacting with the platform, where the user is the one incurring that ``cost.'' COI measures the revenue at risk when information asymmetry collapses.
|
|
|
|
|
A top-level view in the current AI discourse is that sufficiently large productivity gains can induce vertical deflation through cost compression and supply expansion \parencite{rachitsky_marc_2026}. Our contribution is narrower and mechanism-level: even under long-run deflation, platform revenue still depends on short-run information costs to the user. We formalize that rent as the Cost of Information (COI) and study how agentic reconnaissance accelerates its erosion.
|
|
|
|
|
|
|
|
|
|
\begin{definition}[Cost of Information]
|
|
|
|
|
@@ -88,7 +89,7 @@ where $\mathbb{E}[P]$ is the expected price charged by the policy and $\underlin
|
|
|
|
|
\draw[<->, thick, red] (\pmin, 2.0) -- (\mean, 2.0) node[midway, above] {COI};
|
|
|
|
|
|
|
|
|
|
\end{tikzpicture}
|
|
|
|
|
\caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$.}
|
|
|
|
|
\caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$. The abstraction we assume is that the reservation price $\underline{p}$ already has some innate margin and would always result in at least a break-even transaction.}
|
|
|
|
|
\label{fig:coi_illustration}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
@@ -159,14 +160,14 @@ The transformation that governs this dynamic pricing is a very simple surge-base
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\hat{p}_i = \begin{cases}
|
|
|
|
|
p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \theta_{\text{high}} \\
|
|
|
|
|
p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \theta_{\text{low}} \\
|
|
|
|
|
p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \varrho_{\text{high}} \\
|
|
|
|
|
p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \varrho_{\text{low}} \\
|
|
|
|
|
p_{0,i} & \text{otherwise}
|
|
|
|
|
\end{cases}
|
|
|
|
|
\quad \forall i \in \{1, \ldots, N\}
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\theta_{\text{high}}, \theta_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.
|
|
|
|
|
where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\varrho_{\text{high}}, \varrho_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.
|
|
|
|
|
|
|
|
|
|
% For our offline experimental setting, we generalize a master value function that can encompass different demand estimation and pricing strategies.
|
|
|
|
|
%
|
|
|
|
|
@@ -183,6 +184,7 @@ We start from a practical constraint: we do not have access to proprietary produ
|
|
|
|
|
The interface is organized as a product catalog where each product belongs to a time-bounded price vector (for example, a daily pricing period). During each period we collect interaction data by instrumenting UI components and predefined action templates that are still customizable. This gives us control without losing realism.
|
|
|
|
|
|
|
|
|
|
Since users act with motivations, we define a pool of tasks (jobs to be done) and assign tasks randomly to participants.
|
|
|
|
|
We discuss limitations and choices made in this experimental design in Section~\ref{sec:limitations_risks}.
|
|
|
|
|
The task pool is stored as a structured table with fields \texttt{id}, \texttt{created\_at}, \texttt{task\_name}, \texttt{task\_description}, and \texttt{task\_def\_of\_done}. We formulate the tasks as compact jobs-to-be-done rather than as strict click scripts, because the target is to elicit realistic browsing and comparison behavior which can capture nuance of different people. In hotel mode the assigned tasks include \textit{Cheapest Room}, \textit{Cheapest Room w/ View}, \textit{MultiStep Cheapest Room}, \textit{The Digital Nomad (Executive)}, and \textit{The 3-Way Tradeoff (Desk + Quiet + Flexible)}. These prompts deliberately require critical thought in search, inspection of room details, comparison of amenities or images, return visits to the listing page, and a final booking decision which create a degree of cognitive load. In airline mode we use \textit{Last-Minute One-Way Flight}, where the actor must urgently travel to LAX from either SEA or JFK within the next 1--3 days, inspect at least a small set of candidate itineraries, and then book a reasonable earliest departure.
|
|
|
|
|
A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor.
|
|
|
|
|
|
|
|
|
|
@@ -190,7 +192,7 @@ The human data collection involved 13 participants, all of whom provided explici
|
|
|
|
|
|
|
|
|
|
To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.
|
|
|
|
|
|
|
|
|
|
Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to distinguish classes $\theta \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
|
|
|
|
|
Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to distinguish classes $Y \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
|
|
|
|
|
|
|
|
|
|
Our process follows three stages: (1) observe and \textit{vectorize} behavioral interactions, (2) learn distinguishability to characterize human versus agent patterns, and (3) use the learned signal to train a defensive policy in a controlled dynamic-pricing simulator.
|
|
|
|
|
|
|
|
|
|
@@ -263,7 +265,7 @@ v4 & 64 (32 + 32) & us-central2-b & 32 Spot + 32 On-demand \\
|
|
|
|
|
For connections from Madrid, we prioritize the europe-west4 allocation for latency-sensitive runs with the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. % TODO: cite this (from bib)
|
|
|
|
|
Hardware specifications are from the official Google Cloud TPU documentation \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}.
|
|
|
|
|
|
|
|
|
|
Design of training processes: we build docker image with the fact in mind of different caching over layers in order to most speed up docker re-building and such we place the most volatile steps towards the end of the image building. What is means in practice is that any dependency installations are isolated so edits to source code do no trigger rebuilds. Only if we update our entry point of training a sweep, Docker will also rebuild the source-code copy stage.
|
|
|
|
|
Design of training processes: we build docker image with the fact in mind of different caching over layers in order to most speed up docker re-building and such we place the most volatile steps towards the end of the image building. What is means in practice is that any dependency installations are isolated so edits to source code do no trigger rebuilds. Only if we update our entry point of training a sweep, Docker will also rebuild the source-code copy stage. % TODO: cite Docker best practices on cache-efficient Dockerfile layering.
|
|
|
|
|
|
|
|
|
|
Due to the preemptive nature of the current demand of TPU chips we sttle for running our on demeaned as the primary source of compute. The on demand TPU pod of 32 chips spread across 4 virtual hosts creates a relatively unique parallelization setup. Despite our desire to use a traditional approach of clustering and perhaps deploying SLURM jobs of our sweep agent, the lack of predictability in provisioning each instance of a compute resource makes this an high friction layer we do not want to add.
|
|
|
|
|
|
|
|
|
|
@@ -300,8 +302,9 @@ $\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
|
|
|
|
|
Its important to acknowledge that this creates a very blatant assumption in the weighting, we do motivate the scale of each weight by the per-category observed divergence between each behavioral profile.
|
|
|
|
|
It's important to acknowledge that this creates a very blatant assumption in the weighting, and we motivate the scale of each weight by the per-category observed divergence between each behavioral profile.
|
|
|
|
|
In the simulator baseline this order is encoded with a compact fixed scale: cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$. Unknown actions are mapped by prefix heuristics to the nearest category.
|
|
|
|
|
We back this up by saying that each weight was assigned by observing an initial small dataset and computing KL divergence between each interaction type; the ones with the highest divergence receive a proportionately high weight in our demand estimation.
|
|
|
|
|
|
|
|
|
|
The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.
|
|
|
|
|
|
|
|
|
|
@@ -316,7 +319,7 @@ To train a robust pricing learner, we need a simulator that can generate realist
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Ground-Truth Distinguishability}
|
|
|
|
|
Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $\theta_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels distinguishable enough to justify downstream pricing control that depends on that distinguishability?
|
|
|
|
|
Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $Y_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels distinguishable enough to justify downstream pricing control that depends on that distinguishability?
|
|
|
|
|
|
|
|
|
|
To answer this, we compute per-session KL divergence scores against both class-level centroids. For each session $s$ in either partition, we fit a session-level event transition kernel $\hat{\mathcal{T}}_s$ from that session's trajectory alone, then compute its average KL divergence to the human centroid ($\Delta_{H,s}$) and to the agent centroid ($\Delta_{A,s}$). The per-session distinguishability score is the gap $\Delta_{H,s} - \Delta_{A,s}$: a negative value indicates proximity to human behavior, a positive value indicates proximity to agent behavior. The reason behind KL divergence for profile analysis is grounded in its nature and tailored characteristics for probability distributions.
|
|
|
|
|
|
|
|
|
|
@@ -329,6 +332,7 @@ Let $P_e$ and $Q_e$ be categorical distributions over destination states followi
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $\mathcal{S}_e$ denotes the set of destination events that follow $e$ in the human trajectories.
|
|
|
|
|
\end{definition}
|
|
|
|
|
The asymmetry of KL divergence is a point we leverage to natively create divergence from human behavior, to gather signal of the dissimilarity from human-like interactions.
|
|
|
|
|
|
|
|
|
|
To obtain this statistic, we aggregate transitions by triggering event $e$ and treat normalized outgoing probabilities as categorical distributions $P_e$ (human) and $Q_e$ (agent). We intersect shared event labels, then accumulate log-ratio contributions over shared destinations. Large contributions, including near-zero $Q_e(k)$ cases, identify transitions where one actor class is difficult to mimic.
|
|
|
|
|
|
|
|
|
|
@@ -366,6 +370,7 @@ To scale this to catalog-level pricing, we expand the base event transition matr
|
|
|
|
|
\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
|
|
|
|
|
|
|
|
|
|
We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.
|
|
|
|
|
% TODO: add canonical Stackelberg citation.
|
|
|
|
|
|
|
|
|
|
Because contamination level $\alpha$ and demand shift are non-stationary online, a simple error term is not enough. We therefore use a Distributionally Robust Optimization objective. Let $\tau'$ be a newly observed trajectory generated by an unknown actor profile (sampled from the behavioral models in Section~\ref{sec:tpe}). We need a demand mapping conditioned on price and trajectory, $\hat{Q}(p,\tau')$. For each $\tau'$, we compute $\hat{\mathcal{T}}'$ and compare it with controlled baselines $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$:
|
|
|
|
|
|
|
|
|
|
@@ -425,7 +430,7 @@ and we evaluate a small fixed grid in $\mathcal{A}_{\epsilon_\alpha}(\alpha_0)$
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Environment Setup for Dynamic Pricing}
|
|
|
|
|
The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:oracle_flow}. The Oracle maps historical price and demand state to a new price vector, which is exposed to a distribution of demand curves. Each product generates trajectories weighted by behavioral kernels $\tau_\theta$, producing a full transition matrix $\tau'$ over sessions. Sampled trajectories $\{\tau_k\}$ are aggregated through the demand proxy function $Q(\cdot)$ to yield the next demand vector, which feeds back into the Oracle.
|
|
|
|
|
The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:oracle_flow}. The Oracle maps historical price and demand state to a new price vector, which is exposed to a distribution of demand curves. Each product generates trajectories weighted by behavioral kernels $\tau_Y$, producing a full transition matrix $\tau'$ over sessions. Sampled trajectories $\{\tau_k\}$ are aggregated through the demand proxy function $Q(\cdot)$ to yield the next demand vector, which feeds back into the Oracle.
|
|
|
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
|
|
|
\centering
|
|
|
|
|
@@ -441,7 +446,7 @@ p_N
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}}
|
|
|
|
|
\begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix}
|
|
|
|
|
\underrightarrow{\vec{d}\otimes \tau_\theta}
|
|
|
|
|
\underrightarrow{\vec{d}\otimes \tau_Y}
|
|
|
|
|
\begin{bmatrix}
|
|
|
|
|
0.01 & 0.02 & \cdots & 0.3 \\
|
|
|
|
|
0.41 & 0.24 & \cdots & 0.0 \\
|
|
|
|
|
@@ -461,7 +466,7 @@ p_N
|
|
|
|
|
\end{aligned}
|
|
|
|
|
$}%
|
|
|
|
|
}
|
|
|
|
|
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated via the Kronecker product $\vec{d}\otimes\tau_\theta$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
|
|
|
|
|
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated via the Kronecker product $\vec{d}\otimes\tau_Y$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
|
|
|
|
|
\label{fig:oracle_flow}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
@@ -471,7 +476,8 @@ The robust policy $\pi^*$ is obtained by solving the maximin problem:
|
|
|
|
|
\label{eq:robust_policy}
|
|
|
|
|
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') \right]
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty. We note that $p$ is directly dependent on $\pi$ which is the one deicing that as its action.
|
|
|
|
|
where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty. We note that $p$ is directly dependent on $\pi$, which is the one deciding this as its action.
|
|
|
|
|
Looking at the reward structure, note that we are not subtracting COI but rather the leakage of COI, which is as defined below.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In practice, we parameterize this with a session-level leakage term:
|
|
|
|
|
@@ -492,12 +498,12 @@ with fixed $c_{\text{info}}>0$.
|
|
|
|
|
Another possible extension is to adapt the ambiguity radius online, e.g., $\epsilon(\Delta_H)$, so the Wasserstein ball changes with live divergence. We keep this as future work and retain a fixed-radius setup because Wasserstein ambiguity already handles heavy-tail and ``black swan'' behavior without absolute continuity assumptions \parencite{kuhn_wasserstein_2024}.
|
|
|
|
|
|
|
|
|
|
\subsubsection{Actor Implementation}
|
|
|
|
|
In our simulation, the ``follower'' is implemented as a set of Actors. Each Actor is initialized with a type $\theta$ which samples a specific demand curve $d(p; \theta)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.
|
|
|
|
|
In our simulation, the ``follower'' is implemented as a set of Actors. Each Actor is initialized with a class $Y$ and a latent type $\theta \sim \mathcal{D}_Y$, which samples a specific demand curve $d\!\left(p\mid Y,\theta\right)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.
|
|
|
|
|
|
|
|
|
|
Practical implementation of browser agents is a strongly evolving field with near-weekly releases of SOTA architectures. In this thesis implementation we abstract that layer into trajectory generators learned from observed human/agent transition kernels.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. In the current baseline it is not injected into the core reward; it is tracked separately to compare policy trade-offs.
|
|
|
|
|
As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. In code, the UX index is implemented as a volatility penalty on relative price changes, with an extra upward-volatility component weighted by $0.5$ and scaled by $\eta_{\text{ux}}$ and an information-budget term. We also keep a separate supra-competitive penalty tied to persistent price excess above a competitive anchor, which punishes high-price behavior even when volatility is low.
|
|
|
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
|
|
|
\centering
|
|
|
|
|
@@ -541,7 +547,7 @@ We now present the complete pricing mechanism that integrates the behavioral dis
|
|
|
|
|
\end{algorithm}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform applies one discrete multiplicative price action, the environment samples a batch of sessions, and demand is recomputed from weighted events. Robustness is implemented as an inner minimization over a small local grid of contamination candidates around nominal $\alpha_0$, matching the current engine implementation. The history buffer $\mathcal{L}$ (``Limbo'' in our implementation) enforces the alternating Stackelberg structure by preserving the temporal sequence of price publications and demand observations.
|
|
|
|
|
The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform applies one discrete multiplicative price action, the environment samples a batch of sessions, and demand is recomputed from weighted events. Robustness is implemented as an inner minimization over a small local grid of contamination candidates around nominal $\alpha_0$, matching the current engine implementation. The history buffer $\mathcal{L}$ (what we are calling the ``Limbo'' stack in our implementation) enforces the alternating Stackelberg structure by preserving the temporal sequence of price publications and demand observations.
|
|
|
|
|
|
|
|
|
|
%The defensive price update in Line 24 implements contamination-aware margin shrinkage: as estimated contamination $\hat{\alpha}_t$ rises, the margin $(p^{\mathrm{ref}} - c)$ is reduced by factor $\kappa\in[0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring feasibility. In subsequent experiments this heuristic rule is replaced by DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}.
|
|
|
|
|
|
|
|
|
|
|