|
|
|
|
@@ -7,7 +7,7 @@ This section details the theoretical and practical framework developed to addres
|
|
|
|
|
|
|
|
|
|
\subsection{Problem Formalization}
|
|
|
|
|
|
|
|
|
|
We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
|
|
|
|
|
We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $\theta_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
|
|
|
|
|
|
|
|
|
|
Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
@@ -18,7 +18,7 @@ where:
|
|
|
|
|
\item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
|
|
|
|
|
\item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
|
|
|
|
|
\item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
\end{itemize}}
|
|
|
|
|
|
|
|
|
|
The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
@@ -148,7 +148,10 @@ Reproducible results are key to quality research platforms, this is taken into m
|
|
|
|
|
\subsubsection{Online Dynamic Pricing}
|
|
|
|
|
|
|
|
|
|
In order to collect data from actors under correct conditions we replicate a naive and simple dynamic pricing algorithm which runs in the background during the experiments.
|
|
|
|
|
The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$. The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
|
|
|
|
|
The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
|
|
|
|
|
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\hat{p}_i = \begin{cases}
|
|
|
|
|
@@ -183,7 +186,7 @@ The human data collection involved 18 participants, all of whom provided explici
|
|
|
|
|
|
|
|
|
|
To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.
|
|
|
|
|
|
|
|
|
|
Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to separate classes $y \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
|
|
|
|
|
Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to separate classes $\theta \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
|
|
|
|
|
|
|
|
|
|
Our process follows three stages: (1) observe and \textit{vectorize} behavioral interactions, (2) learn separability to characterize human versus agent patterns, and (3) use the learned signal to train a defensive policy in a controlled dynamic-pricing simulator.
|
|
|
|
|
|
|
|
|
|
@@ -207,6 +210,7 @@ The simulator has multiple configurable factors. We design a multi-factor study
|
|
|
|
|
% Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n=18 per group. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
|
|
|
|
|
While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
|
|
|
|
|
|
|
|
|
|
% TODO: cite in the apendix the math to get to 160 petaflops of compute
|
|
|
|
|
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160 PFLOPS of aggregate compute, which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
|
|
|
|
|
|
|
|
|
|
\begin{table}[ht]
|
|
|
|
|
@@ -281,7 +285,7 @@ $\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{
|
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
|
|
|
|
|
|
|
|
|
|
Its important to acknowledge that this creates a very blatant assumption in the weighting, we do motivate the scale of each weight by the per-category observed divergence between each behavioral profile.
|
|
|
|
|
In the simulator baseline this order is encoded with a compact fixed scale: cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$. Unknown actions are mapped by prefix heuristics to the nearest category.
|
|
|
|
|
|
|
|
|
|
The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.
|
|
|
|
|
@@ -289,13 +293,15 @@ The metadata record $\mu$ varies by action type. For product views, $\mu$ contai
|
|
|
|
|
In addition to behavioral events, the platform logs price observations to a separate Kafka topic. Each price query generates a record $(i, p, \text{sid}, \phi, t)$ associating the product, displayed price, requesting session, platform mode, and timestamp. This dual-stream architecture enables joint analysis of price exposure and behavioral response.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Generative Contamination and Separability}
|
|
|
|
|
|
|
|
|
|
To train a robust pricing learner, we need a simulator that can generate realistic interaction data under controlled contamination. We build this from Phantom data using a two-stage approach.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Ground-Truth Separability}
|
|
|
|
|
Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $y_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?
|
|
|
|
|
Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $\theta_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?
|
|
|
|
|
|
|
|
|
|
To answer this, we compute average KL divergence between transition probability matrices. This statistic gives global separability and event-level diagnostics at the same time. To test whether the observed between-class value exceeds finite-sample estimation noise, we compute an intra-class bootstrap baseline by repeatedly splitting $\mathcal{D}_H$ and $\mathcal{D}_A$ into two random halves, fitting a transition kernel on each half, and re-computing the same average KL statistic for each split.
|
|
|
|
|
|
|
|
|
|
@@ -303,7 +309,7 @@ Formally, for $B$ bootstrap splits per class we obtain reference samples $\{d_{H
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\hat p = \frac{1 + \sum_{j=1}^{2B}\mathbf{1}\{d_j^{\text{intra}} \ge d^{\text{inter}}\}}{2B + 1},
|
|
|
|
|
\end{equation}
|
|
|
|
|
which gives a direct significance check for separability before using divergence-derived control signals in pricing.
|
|
|
|
|
which gives a direct significance check for separability before using divergence-derived centroid control signals in pricing.
|
|
|
|
|
|
|
|
|
|
\begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
|
|
|
|
|
Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
|
|
|
|
|
@@ -346,9 +352,6 @@ To scale this to catalog-level pricing, we expand the base event transition matr
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Second-Stage Classification}
|
|
|
|
|
After contamination, we run a second classification stage. We remap events into a semantically aligned feature space, apply richer feature engineering, and retrain to obtain cleaner label probabilities across the full dataset. This classifier is then used directly in the reinforcement-learning reward structure.
|
|
|
|
|
|
|
|
|
|
\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
|
|
|
|
|
|
|
|
|
|
We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.
|
|
|
|
|
@@ -383,6 +386,44 @@ For the current engine baseline, we use a compact inner-robust approximation by
|
|
|
|
|
and we evaluate a small fixed grid in $\mathcal{A}_{\epsilon_\alpha}(\alpha_0)$ per step, selecting the worst-case candidate for the learner.
|
|
|
|
|
% A proper Wasserstein ball implementation over the full demand distribution (rather than a scalar alpha interval) would use the POT library (Python Optimal Transport): compute W_2 between the empirical reference P_hat and each candidate Q using ot.emd2() or ot.sliced_wasserstein_distance() for scalability, then accept only candidates within epsilon. In practice the inner minimization becomes: candidates = [G(alpha) for alpha in linspace]; dists = [ot.emd2(p_hat, q, M) for q in candidates]; worst = candidates[argmin(reward[dists <= epsilon])]. The current grid-on-alpha approximation is a computationally cheap substitute; moving to a true Wasserstein ball would tighten the worst-case guarantee but requires specifying the ground metric M over the demand space.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Environment Setup for Dynamic Pricing}
|
|
|
|
|
The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:oracle_flow}. The Oracle maps historical price and demand state to a new price vector, which is exposed to a distribution of demand curves. Each product generates trajectories weighted by behavioral kernels $\tau_\theta$, producing a full transition matrix $\tau'$ over sessions. Sampled trajectories $\{\tau_k\}$ are aggregated through the demand proxy function $Q(\cdot)$ to yield the next demand vector, which feeds back into the Oracle.
|
|
|
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
|
|
|
\centering
|
|
|
|
|
\[
|
|
|
|
|
\text{Oracle}(\vec{p}_{t-1},\vec{\hat{q}})\to
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
p_0\\
|
|
|
|
|
p_1\\
|
|
|
|
|
\cdots\\
|
|
|
|
|
p_N
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}}
|
|
|
|
|
\begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix}
|
|
|
|
|
\underrightarrow{\vec{d}\times \tau_\theta \to \tau^\prime}
|
|
|
|
|
\begin{bmatrix}
|
|
|
|
|
0.01 & 0.02 & \cdots & 0.3 \\
|
|
|
|
|
0.41 & 0.24 & \cdots & 0.0 \\
|
|
|
|
|
\cdots & \cdots & \cdots & \cdots \\
|
|
|
|
|
0.51 & 0.09 & \cdots & 0.1 \\
|
|
|
|
|
\end{bmatrix}
|
|
|
|
|
\underrightarrow{\tau_k \sim \tau^\prime}
|
|
|
|
|
\{\tau_k\}_{k=0}^K \to \hat{Q}(\tau_k)
|
|
|
|
|
\\
|
|
|
|
|
\to \begin{pmatrix}
|
|
|
|
|
\hat{q}_0 \\
|
|
|
|
|
\hat{q}_1 \\
|
|
|
|
|
\cdots \\
|
|
|
|
|
\hat{q}_N \\
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
\to \text{Oracle}(\cdot)
|
|
|
|
|
\]
|
|
|
|
|
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated by mixing demand with behavioral kernels $\tau_\theta$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
|
|
|
|
|
\label{fig:oracle_flow}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
\subsubsection{The Min-Max Objective}
|
|
|
|
|
The robust policy $\pi^*$ is obtained by solving the maximin problem:
|
|
|
|
|
\begin{equation}
|
|
|
|
|
|