chore: updatimg emthodoloyg

This commit is contained in:
2026-04-09 16:55:46 +02:00
parent e694d38bce
commit c0c375548c
3 changed files with 36 additions and 27 deletions

View File

@@ -3,7 +3,7 @@
% Extra notes and clarifications: we observed some humans and get their transition probabilities between event types % Extra notes and clarifications: we observed some humans and get their transition probabilities between event types
% We modify behavioral profiles of transition matrices with price elasticity matrices generated by sample valuations of a distributing. % We modify behavioral profiles of transition matrices with price elasticity matrices generated by sample valuations of a distributing.
This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven distinguishability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets. This section addresses the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven distinguishability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.
\subsection{Problem Formalization} \subsection{Problem Formalization}
@@ -25,7 +25,7 @@ The platform does not directly observe the true underlying demand function $d(p)
\label{eq:qhat} \label{eq:qhat}
\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbf{1}[i_{s,k} = i] \hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbf{1}[i_{s,k} = i]
\end{equation} \end{equation}
where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay. where $\omega: \mathcal{A} \to \mathbb{R}^+$ assigns weights to actions based on their signal strength regarding willingness to pay.
In the current engine implementation, we use the normalized variant of this proxy for each step: In the current engine implementation, we use the normalized variant of this proxy for each step:
\begin{equation} \begin{equation}
@@ -48,7 +48,7 @@ Accounting for behavioral and market variation, we also treat $\epsilon_t$ as ab
\subsection{Cost of Information (COI) Framework} \subsection{Cost of Information (COI) Framework}
The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. The intuition behind this being a cost comes from the perspective of the user who is interacting with the platform, where the user is the one incurring that ``cost.'' COI measures the revenue at risk when information asymmetry collapses. The platform's pricing power comes from information asymmetry: users who express strong interest signals pay more than the base price. We quantify this markup as the \textit{Cost of Information} (COI), which represents the average premium extracted above marginal cost. The intuition behind this being a cost comes from the perspective of the user who is interacting with the platform, where the user is the one incurring that ``cost.'' COI measures the revenue at risk when information asymmetry collapses.
A top-level view in the current AI discourse is that sufficiently large productivity gains can induce vertical deflation through cost compression and supply expansion \parencite{rachitsky_marc_2026}. Our contribution is narrower and mechanism-level: even under long-run deflation, platform revenue still depends on short-run information costs to the user. We formalize that rent as the Cost of Information (COI) and study how agentic reconnaissance accelerates its erosion. A top-level view in the current AI discourse is that sufficiently large productivity gains can induce vertical deflation (vertical supply chain price decrease) through cost compression and supply expansion \parencite{rachitsky_marc_2026}. Our contribution is narrower and mechanism-level: even under long-run deflation, platform revenue still depends on short-run information costs to the user. We formalize that rent as the Cost of Information (COI) and study how agentic reconnaissance accelerates its erosion.
\begin{definition}[Cost of Information] \begin{definition}[Cost of Information]
Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as: Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
@@ -137,9 +137,9 @@ This result implies that standard pricing policies $\pi$ cannot extract the same
\subsection{System Architecture: Hybrid Kappa-Lambda Architecture} \subsection{System Architecture: Hybrid Kappa-Lambda Architecture}
In order for our research to have grounding in interactions we built a robust e-commerce web-platform. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. Our analysis revealed a clear industry standard: while both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment-level argument which adjusts the proxy routing with custom middleware in Next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application. In order for our research to have grounding in interactions we built a robust e-commerce web-platform. In this framing Kappa represents streamed processing and Lambda batch operations as is given by terminology in big-data processing. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. To better understand the playing field, we collected artifacts on design across various airlines and hotels. While both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment-level argument which adjusts the proxy routing with custom middleware in Next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application.
The architecture begins with deployed web applications posting interaction data to a backend that stores each record in Apache Kafka. Kafka acts as the reservoir linking sessions to experiments. Behavioral events and, separately, price observations from the pricing-provider microservice (invoked by the frontend) land in Kafka topics. A scheduled Airflow pipeline (with manual triggers) consumes the stream; the final pricing stage writes vectors to Redis for low-latency reads by the provider and display in the client. The pattern is deliberately standard---Kafka for durability and replay, Redis for serving---so the same skeleton applies across e-commerce settings. We invested in this stack to keep runs reproducible and to limit extraneous variance. The architecture begins with deployed web applications posting interaction data to a backend that stores each record in Apache Kafka. Kafka acts as the reservoir linking sessions to experiments. Behavioral events and, separately, price observations from the pricing-provider microservice (invoked by the frontend) land in Kafka topics. A scheduled Airflow pipeline (with manual triggers) consumes the stream and the final pricing stage writes vectors to Redis for low-latency reads by the provider and display in the client. This design pattern allows us to generalize to other commercial settings, where Kafka is used for durability and replay, Redis for serving and quick queries. We invested in this stack to keep runs reproducible and to limit extraneous variance so the same skeleton applies across e-commerce settings
\paragraph{Public Web Artifact} We transition the Kappa like architecture of the data collection to a Lambda architecture for actual learning in a surrogate environment. This allows us to move faster on data which is provided and helps us create a feedback loop for production deployment. To support further research in this intersection of fields we release P4P \footnote{\url{https://github.com/velocitatem/p4p}} as a public repository providing the interaction layer of the PHANTOM framework. This provides a configurable storefront which can be tailored to any commercial setting with a standardized session-level event tracking. We document the API adapters and expected schemas for pricing providers and log ingestion services. The repository is intended for controlled experimentation and method replication rather than production commerce deployment. \paragraph{Public Web Artifact} We transition the Kappa like architecture of the data collection to a Lambda architecture for actual learning in a surrogate environment. This allows us to move faster on data which is provided and helps us create a feedback loop for production deployment. To support further research in this intersection of fields we release P4P \footnote{\url{https://github.com/velocitatem/p4p}} as a public repository providing the interaction layer of the PHANTOM framework. This provides a configurable storefront which can be tailored to any commercial setting with a standardized session-level event tracking. We document the API adapters and expected schemas for pricing providers and log ingestion services. The repository is intended for controlled experimentation and method replication rather than production commerce deployment.
@@ -167,7 +167,7 @@ The transformation that governs this dynamic pricing is a very simple surge-base
\quad \forall i \in \{1, \ldots, N\} \quad \forall i \in \{1, \ldots, N\}
\end{equation} \end{equation}
where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\varrho_{\text{high}}, \varrho_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing. where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\varrho_{\text{high}}, \varrho_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to work with actors within our experiments to a system with a dynamic component of pricing.
% For our offline experimental setting, we generalize a master value function that can encompass different demand estimation and pricing strategies. % For our offline experimental setting, we generalize a master value function that can encompass different demand estimation and pricing strategies.
% %
@@ -179,14 +179,16 @@ where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our
\subsection{Experimental Design} \subsection{Experimental Design}
We start from a practical constraint: we do not have access to proprietary production data. Because of that, we design our own fictional platform that still represents how commercial platforms work in the real world. The design comes from a survey of hotel and airline websites, where we extracted common interface components and used them as a high-level template for dynamic pricing environments. % We start from a practical constraint: we do not have access to proprietary production data. Because of that, we design our own fictional platform that still represents how commercial platforms work in the real world. The design comes from a survey of hotel and airline websites, where we extracted common interface components and used them as a high-level template for dynamic pricing environments.
In the aforementioned platform we develop for our experiments, we use the surveyed websites and create an \textit{average} representation of what the most expected interfaces would be by extracting common components and designing a high level template for dynamic pricing environments.
The interface is organized as a product catalog where each product belongs to a time-bounded price vector (for example, a daily pricing period). During each period we collect interaction data by instrumenting UI components and predefined action templates that are still customizable. That yields controlled variation while keeping the interface credible.
The interface is organized as a product catalog where each product belongs to a time-bounded price vector (for example, a daily pricing period). During each period we collect interaction data by instrumenting UI components and predefined action templates that are still customizable. That yields controlled variation while keeping the interface controlled-for.
Since users act with motivations, we define a pool of tasks (jobs to be done) and assign tasks randomly to participants. Since users act with motivations, we define a pool of tasks (jobs to be done) and assign tasks randomly to participants.
We discuss limitations and choices made in this experimental design in Section~\ref{sec:limitations_risks}. We discuss limitations and choices made in this experimental design in Section~\ref{sec:limitations_risks}.
The task pool is stored as a structured table with fields \texttt{id}, \texttt{created\_at}, \texttt{task\_name}, \texttt{task\_description}, and \texttt{task\_def\_of\_done}. We formulate the tasks as compact jobs-to-be-done rather than as strict click scripts, because the target is to elicit realistic browsing and comparison behavior which can capture nuance of different people. In hotel mode the assigned tasks include \textit{Cheapest Room}, \textit{Cheapest Room w/ View}, \textit{MultiStep Cheapest Room}, \textit{The Digital Nomad (Executive)}, and \textit{The 3-Way Tradeoff (Desk + Quiet + Flexible)}. These prompts deliberately require critical thought in search, inspection of room details, comparison of amenities or images, return visits to the listing page, and a final booking decision which create a degree of cognitive load. In airline mode we use \textit{Last-Minute One-Way Flight}, where the actor must urgently travel to LAX from either SEA or JFK within the next 1--3 days, inspect at least a small set of candidate itineraries, and then book a reasonable earliest departure. The task pool is stored as a structured table with fields \texttt{id}, \texttt{created\_at}, \texttt{task\_name}, \texttt{task\_description}, and \texttt{task\_def\_of\_done}. We formulate the tasks as compact jobs-to-be-done rather than as rigid instructions, because the target is to elicit realistic browsing and comparison behavior which can capture nuance of different people. In hotel mode the assigned tasks include \textit{Cheapest Room}, \textit{Cheapest Room w/ View}, \textit{MultiStep Cheapest Room}, \textit{The Digital Nomad (Executive)}, and \textit{The 3-Way Tradeoff (Desk + Quiet + Flexible)}. These prompts deliberately require critical thought in search, inspection of room details, comparison of amenities or images, return visits to the listing page, and a final booking decision which create a degree of cognitive load. In airline mode we use \textit{Last-Minute One-Way Flight} or \textit{Family/Work Emergency Travel}, where the actor must urgently travel to LAX from either SEA or JFK within the next 1 to 3 days, inspect at least a small set of candidate itineraries, and then book a reasonable earliest departure.
A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor. A representative task is to find the cheapest feasible catalog item under explicit constraints while removing strict financial limits so we avoid trivial optimization behavior. Participants are also randomly assigned to one experimental platform mode (hotel or airline). Once assigned, they are dropped into the experiment with an actor ID. Under each experiment ID, we can observe multiple sessions across time and gather long interaction traces for the same actor. This de-risks our lower sample size of individuals by allowing broad interaction data to come from each one.
The human data collection involved 13 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 13 human sessions we ran 16 agent sessions of equivalent task scope, yielding 29 labeled trajectories in total (45\% human, 55\% agent). Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy. The human data collection involved 13 participants, all of whom provided explicit informed consent prior to their session. Participants had an average age of 21 years and were recruited from a university population. Alongside the 13 human sessions we ran 16 agent sessions of equivalent task scope, yielding 29 labeled trajectories in total (45\% human, 55\% agent). Each participant was assigned a single platform mode and a single task drawn from the pool, and completed the session independently without guidance on navigation or pricing strategy.
@@ -218,15 +220,15 @@ Our web platform (developed in similar spirit to RecSim \parencite{ie_recsim_201
To speak to realism, user interviews reported that the platform architecture mirrored standard booking interfaces and reduced the cognitive load required to learn the system. One participant described the flow as ``intuitive'' and close to a ``normal'' transaction, suggesting observed behavior was primarily driven by pricing treatment rather than interface novelty. To speak to realism, user interviews reported that the platform architecture mirrored standard booking interfaces and reduced the cognitive load required to learn the system. One participant described the flow as ``intuitive'' and close to a ``normal'' transaction, suggesting observed behavior was primarily driven by pricing treatment rather than interface novelty.
The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. The responses match what one expects from live commerce: sharp reactions to volatility and to list--checkout gaps, which supports external validity despite the lab setting. The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. The responses match what one expects from live e-commerce experiences, such as reactions to volatility, which supports external validity despite the lab setting.
\subsubsection{Design of Training Sweeps} \subsubsection{Design of Training Sweeps}
The simulator has multiple configurable factors. Training runs are driven by Weights \& Biases sweep definitions versioned with the codebase, mixing random and grid schedules rather than a single full factorial. For the contamination ratio $\alpha$, exploratory sweeps draw $\alpha$ uniformly on $[0.1,0.6]$; some sweeps use the narrower interval $[0.1,0.5]$. Grid sweeps fix explicit level sets, for example $\alpha\in\{0.1,0.2,0.3,0.4,0.6,0.8\}$ (six levels, including $0.8$ beyond the typical exploratory upper endpoint) or five levels $\{0.1,0.2,0.3,0.4,0.6\}$. Auxiliary schedules also include $\alpha=0$ alongside positive values. Robustness radius $\epsilon_\alpha$, COI penalty $\lambda_\text{coi}$, RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}), and the discretization of the price action grid vary by sweep. Broad random search may use uniform $\epsilon_\alpha\in[0,0.3]$ and $\lambda_\text{coi}\in[0.05,0.6]$; tighter grids may fix $\epsilon_\alpha=0.2$ and restrict $\lambda_\text{coi}$ to $\{0.15,0.30\}$. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$. The simulator has multiple configurable factors. Training runs are driven by Weights \& Biases sweep definitions versioned with the codebase, mixing random and grid schedules rather than a single full factorial. For the contamination ratio $\alpha$, exploratory sweeps draw $\alpha$ uniformly on $[0.1,0.6]$ and then some sweeps use the narrower interval $[0.1,0.5]$. Grid sweeps fix explicit level sets, for example $\alpha\in\{0.1,0.2,0.3,0.4,0.6,0.8\}$ (six levels, including $0.8$ beyond the typical exploratory upper endpoint) or five levels $\{0.1,0.2,0.3,0.4,0.6\}$. Auxiliary schedules also include $\alpha=0$ alongside positive values. Robustness radius $\epsilon_\alpha$, COI penalty $\lambda_\text{coi}$, RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}), and the discretization of the price action grid vary by sweep. Broad random search may use uniform $\epsilon_\alpha\in[0,0.3]$ and $\lambda_\text{coi}\in[0.05,0.6]$; tighter grids may fix $\epsilon_\alpha=0.2$ and restrict $\lambda_\text{coi}$ to $\{0.15,0.30\}$. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable. While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted. Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration, and reserve on-demand v4 capacity for runs that should not be interrupted.
\begin{table}[ht] \begin{table}[ht]
\centering \centering
@@ -265,9 +267,9 @@ For connections from Madrid, we prioritize the europe-west4 allocation for the s
% TODO: cite this (from bib) % TODO: cite this (from bib)
Training images follow Docker layer caching: dependency layers are separate from the copy of application source so routine code edits do not invalidate the entire build; only changes to the training entrypoint or dependencies force a full rebuild. Training images abide by Docker layer caching principles with maximal caching on the lowest levels. Dependency layers are separate from the copy of application source so code edits or tweaks do not re-boot the entire build such that only changes to the training entrypoint or dependencies force a full rebuild.
TPU capacity is scarce and often preemptible, so we rely primarily on on-demand pods for workloads that must finish without interruption. A typical reservation is a 32-chip pod across four worker VMs; that layout already gives enough parallelism for our sweep driver without adding a separate cluster scheduler. We considered SLURM-style job arrays, but fluctuating provisioning times would have added operational overhead with little benefit for our workload, so orchestration stays in the container and Ray layer described below. TPU capacity is scarce and often preemptible, so we rely primarily on on-demand pods for workloads that must finish without interruption. A typical reservation is a 32-chip pod across four worker VMs. That layout already gives enough parallelism for our sweep driver without adding a separate cluster scheduler. We considered SLURM-style job arrays, but fluctuating provisioning times would have added operational overhead with little benefit for our workload, so orchestration stays in the container and Ray layer described below.
\subsubsection{Interaction Schema} \subsubsection{Interaction Schema}
@@ -275,7 +277,7 @@ We extend the basic event tuple $e_{s,k}$ to capture the full observational sign
\begin{equation} \begin{equation}
e_{s,k} = \left( a_{s,k}, \, i_{s,k}, \, t_{s,k}, \, \mu_{s,k}, \, \delta_{s,k} \right) e_{s,k} = \left( a_{s,k}, \, i_{s,k}, \, t_{s,k}, \, \mu_{s,k}, \, \delta_{s,k} \right)
\end{equation} \end{equation}
where $\mu_{s,k} \in \mathcal{M}$ is a metadata record containing action-specific context (e.g., price observed, filter parameters, element text), and $\delta_{s,k} \in \mathbb{R}_+$ is the dwell time in milliseconds for attention-based actions. where $\mu_{s,k} \in \mathcal{M}$ is a metadata record containing action-specific context (e.g., price observed, filter parameters, element text), and $\delta_{s,k} \in \mathbb{R}^+$ is the dwell time in milliseconds for attention-based actions.
A session $s$ is itself a structured record: A session $s$ is itself a structured record:
\begin{equation} \begin{equation}
@@ -302,7 +304,7 @@ $\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{
\end{table} \end{table}
This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment. This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
The ordering cart $>$ dwell $>$ nav $>$ filter is a deliberate simplification: we set it from early data by ranking categories by KL divergence between human and agent transition rows and then spacing weights in powers of two. The simulator encodes cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$; unknown actions map by prefix to the nearest category. The ordering cart $>$ dwell $>$ nav $>$ filter is a deliberate simplification: we set it from early data by ranking categories by KL divergence between human and agent transition rows and then spacing weights in powers of two. The simulator encodes cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$ and finally unknown actions map by prefix to the nearest category (or are discarded).
The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage. The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.
@@ -367,7 +369,7 @@ To scale this to catalog-level pricing, we expand the base event transition matr
\subsection{Distributionally Robust Reinforcement Learning (DR-RL)} \subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue. We formulate pricing as a Stackelberg game in which the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.
% TODO: add canonical Stackelberg citation. % TODO: add canonical Stackelberg citation.
Because contamination level $\alpha$ and demand shift are non-stationary online, a simple error term is not enough. We therefore use a Distributionally Robust Optimization objective. Let $\tau'$ be a newly observed trajectory generated by an unknown actor profile (sampled from the behavioral models in Section~\ref{sec:tpe}). We need a demand mapping conditioned on price and trajectory, $\hat{Q}(p,\tau')$. For each $\tau'$, we compute $\hat{\mathcal{T}}'$ and compare it with controlled baselines $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$: Because contamination level $\alpha$ and demand shift are non-stationary online, a simple error term is not enough. We therefore use a Distributionally Robust Optimization objective. Let $\tau'$ be a newly observed trajectory generated by an unknown actor profile (sampled from the behavioral models in Section~\ref{sec:tpe}). We need a demand mapping conditioned on price and trajectory, $\hat{Q}(p,\tau')$. For each $\tau'$, we compute $\hat{\mathcal{T}}'$ and compare it with controlled baselines $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$:
@@ -379,7 +381,7 @@ Because contamination level $\alpha$ and demand shift are non-stationary online,
\Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A) \Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
\end{align} \end{align}
From these two divergences we define the gap score: From these two divergences we define the gap score following previously highlighted intuition of the divergence:
\begin{equation} \begin{equation}
g(\tau') = \Delta_H(\tau') - \Delta_A(\tau'). g(\tau') = \Delta_H(\tau') - \Delta_A(\tau').
\end{equation} \end{equation}
@@ -394,6 +396,13 @@ The session-level control signal injected into pricing is then
\hat{\alpha}(\tau') = f(\tau'). \hat{\alpha}(\tau') = f(\tau').
\end{equation} \end{equation}
\begin{figure}[ht]
\centering
\input{chapters/figures/sigmoid_softmax_gap.tex}
\caption{Logistic mapping from the gap $\Delta_H-\Delta_A$ to the weak agent probability $f(\tau')$. Markers indicate the contrasts $\Delta_H<\Delta_A$ and $\Delta_H>\Delta_A$.}
\label{fig:sigmoid_softmax_gap}
\end{figure}
This turns distinguishability into an operational control input in the engine. On a per-customer or use-case basis, a similar data collection and fitting process should be repeated to obtain domain-specific behavior kernels. This turns distinguishability into an operational control input in the engine. On a per-customer or use-case basis, a similar data collection and fitting process should be repeated to obtain domain-specific behavior kernels.
In implementation we keep an alternating game-history buffer and advance it each epoch with two transitions where the platform publishes a price vector (leader move), then the environment returns trajectory-derived demand (follower move). We call this the \textit{Limbo}. In implementation we keep an alternating game-history buffer and advance it each epoch with two transitions where the platform publishes a price vector (leader move), then the environment returns trajectory-derived demand (follower move). We call this the \textit{Limbo}.
@@ -401,7 +410,7 @@ In implementation we keep an alternating game-history buffer and advance it each
To avoid notation drift, we separate two COI objects used for different purposes: To avoid notation drift, we separate two COI objects used for different purposes:
\begin{align} \begin{align}
\text{COI}_{\text{level}}(\pi) &= \mathbb{E}[P]-\underline{p}\\ \text{COI}_{\text{level}}(\pi) &= \mathbb{E}[P]-\underline{p}\\
\text{COI}_{\text{leak}}(p,\tau') &= f(\tau')\cdot \text{InfoValue}(p,\tau') \quad \text{(local control penalty)} \text{COI}_{\text{leak}}(p,\tau') &= f(\tau')\cdot \text{InfoValue}(p,\tau')
\end{align} \end{align}
where $\text{COI}_{\text{level}}$ is evaluated at policy level and $\text{COI}_{\text{leak}}$ is evaluated per observed quote during training. Subsequently, when discussing the reward structure, we will better understand the term of the information value. where $\text{COI}_{\text{level}}$ is evaluated at policy level and $\text{COI}_{\text{leak}}$ is evaluated per observed quote during training. Subsequently, when discussing the reward structure, we will better understand the term of the information value.
@@ -468,9 +477,9 @@ $}%
The robust policy $\pi^*$ is obtained by solving the maximin problem: The robust policy $\pi^*$ is obtained by solving the maximin problem:
\begin{equation} \begin{equation}
\label{eq:robust_policy} \label{eq:robust_policy}
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') \right] \pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') - \eta_{\text{ux}} \cdot \text{UX}(\tau', p) \right]
\end{equation} \end{equation}
where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty. We note that $p$ is directly dependent on $\pi$, which is the one deciding this as its action. where $R(p, d)$ is the revenue function, $\lambda$ weighs the information-leakage penalty, $\eta_{\text{ux}}$ weighs the user-experience penalty, and $\text{UX}(\tau', p)\in[0,1]$. We note that $p$ is directly dependent on $\pi$, which is the one deciding this as its action.
Looking at the reward structure, note that we are not subtracting COI but rather the leakage of COI, which is as defined below. Looking at the reward structure, note that we are not subtracting COI but rather the leakage of COI, which is as defined below.
@@ -484,7 +493,7 @@ The inner minimization selects the contamination candidate that makes the penali
For the baseline engine reported here, we intentionally use the constant query-tax surrogate to keep the mechanism minimal: For the baseline engine reported here, we intentionally use the constant query-tax surrogate to keep the mechanism minimal:
\begin{equation} \begin{equation}
r_t = R(p_t,\tilde q_t) - \lambda\,f(\tau_t')\,c_{\text{info}} r_t = \ldots - \lambda\,f(\tau_t')\,c_{\text{info}}
\end{equation} \end{equation}
with fixed $c_{\text{info}}>0$. with fixed $c_{\text{info}}>0$.
@@ -557,7 +566,7 @@ The baseline achieves approximately 26 steps per second. Enabling the robustness
\begin{table}[ht] \begin{table}[ht]
\centering \centering
\caption{Per-step profiling results (20 steps, $M=10$ sessions, $N=3$ products). Self-time measures time spent inside the function excluding callees; cumulative time includes the full call subtree.} \caption{Per-step profiling results (20 steps, $M=10$ sessions, $N=3$ products). Self-time measures time spent inside the function excluding callees & cumulative time includes the full call subtree.}
\label{tab:profile_results} \label{tab:profile_results}
\begingroup \begingroup
\small \small

View File

@@ -300,9 +300,9 @@ where $W_p$ is the $p$-Wasserstein distance and $\epsilon > 0$ is the ambiguity
The platform seeks a policy $\pi^*$ that maximizes worst-case revenue over the ambiguity set while penalizing information leakage to suspected agents: The platform seeks a policy $\pi^*$ that maximizes worst-case revenue over the ambiguity set while penalizing information leakage to suspected agents:
\begin{equation} \begin{equation}
\label{eq:robust_policy} \label{eq:robust_policy}
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \; \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p, \tau') - \eta \cdot \text{UX}(\tau', p) \right] \pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \; \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p, \tau') - \eta_{\text{ux}} \cdot \text{UX}(\tau', p) \right]
\end{equation} \end{equation}
where $R(p, d) = p \cdot d$ is the revenue function. where $R(p, d) = p \cdot d$ is the revenue function, $\lambda$ scales COI leakage, and $\eta_{\text{ux}}$ scales the UX penalty with $\text{UX}(\tau', p)\in[0,1]$.
\begin{definition}[COI Leakage] \begin{definition}[COI Leakage]
The per-query information leakage cost is: The per-query information leakage cost is:

View File

@@ -80,7 +80,7 @@ Because contamination level $\alpha$ and demand shift are non-stationary online,
We therefore use a Distributionally Robust Optimization objective. We therefore use a Distributionally Robust Optimization objective.
We define an ambiguity set $\mathcal{U}_\epsilon(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We define an ambiguity set $\mathcal{U}_\epsilon(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$).
We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face. We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face.
The robust policy $\pi^*$ is obtained by solving the maximin problem $\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') \right]$ where $R(p, d)$ is the revenue function and $\lambda$ weighs the information-leakage penalty. The robust policy $\pi^*$ is obtained by solving the maximin problem $\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}_{\text{leak}}(p,\tau') - \eta_{\text{ux}} \cdot \text{UX}(\tau', p) \right]$ where $R(p, d)$ is the revenue function, $\lambda$ weighs the information-leakage penalty, and $\eta_{\text{ux}}$ weighs the UX term.
In practice, we parameterize this with a session-level leakage term $\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot \text{InfoValue}(p,\tau')$ where $f(\tau')$ is the weak agent probability. In practice, we parameterize this with a session-level leakage term $\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot \text{InfoValue}(p,\tau')$ where $f(\tau')$ is the weak agent probability.
As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis.
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve.