mirror of
https://github.com/velocitatem/PHANTOM.git
synced 2026-05-31 08:33:36 +00:00
366 lines
30 KiB
TeX
366 lines
30 KiB
TeX
\section{Methodology}
|
|
|
|
This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven separability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.
|
|
|
|
\subsection{Problem Formalization}
|
|
|
|
We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
|
|
|
|
Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
|
|
\begin{equation}
|
|
e_{s,k} = (a_{s,k}, i_{s,k}, t_{s,k})
|
|
\end{equation}
|
|
where:
|
|
\begin{itemize}
|
|
\item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
|
|
\item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
|
|
\item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
|
|
\end{itemize}
|
|
|
|
The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
|
|
\begin{equation}
|
|
\label{eq:qhat}
|
|
\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbb{1}[i_{s,k} = i]
|
|
\end{equation}
|
|
where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay.
|
|
|
|
\subsubsection{Actor Types and Demand Curves}
|
|
We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the naively defined mixture:
|
|
\begin{equation}
|
|
\label{eq:mixture_demand}
|
|
Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
|
|
\end{equation}
|
|
where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.
|
|
|
|
|
|
|
|
\subsection{Cost of Information (COI) Framework}
|
|
|
|
The \textit{Cost of Information} (COI) represents the markup a pricing policy $\pi$ attempts to extract from the market by leveraging demand signals. We define COI as the expected premium over the minimum viable price $\underline{p}$ (or marginal cost). This also speaks to the financial urgency as a consequence of information asymmetry between the platform and the actors.
|
|
|
|
\begin{definition}[Cost of Information]
|
|
Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
|
|
\begin{align}
|
|
\text{COI} &= \mathbb{E}[P] - \underline{p} \\
|
|
&= \int_{\underline{p}}^{\bar{p}} (1 - F_\pi(p)) \, dp
|
|
\end{align}
|
|
where $F_\pi(p)$ is the cumulative distribution function of prices generated by $\pi$ under standard operating conditions.
|
|
\end{definition}
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\begin{tikzpicture}[scale=1.2]
|
|
% Define the Gaussian function: centered at 2
|
|
\def\bellcurve(#1){1.5 * exp(-0.5*((#1-2)/0.6)^2)}
|
|
|
|
% Draw the main axis
|
|
\draw[->, thick] (0, 0) -- (4.5, 0) node[right] {$p$};
|
|
\draw[->, thick] (0, 0) -- (0, 2) node[above] {Density};
|
|
|
|
\draw[thick, smooth, samples=100] plot[domain=0:4] (\x, {\bellcurve(\x)});
|
|
\node at (3.2, 1.2) {$f_\pi(p)$};
|
|
|
|
% Define p_min and E[p]
|
|
\def\pmin{0.8}
|
|
\def\mean{2}
|
|
|
|
% Vertical lines
|
|
\draw[dashed] (\pmin, 0) -- (\pmin, 2.0);
|
|
\draw[dashed] (\mean, 0) -- (\mean, 2.0);
|
|
|
|
% Labels on axis
|
|
\node[below] at (\pmin, 0) {$\underline{p}$};
|
|
\node[below] at (\mean, 0) {$\mathbb{E}[p]$};
|
|
|
|
\draw[<->, thick, red] (\pmin, 2.0) -- (\mean, 2.0) node[midway, above] {COI};
|
|
|
|
\end{tikzpicture}
|
|
\caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$.}
|
|
\label{fig:coi_illustration}
|
|
\end{figure}
|
|
|
|
We now formally demonstrate that standard dynamic pricing mechanisms are not incentive-compatible with high-frequency agentic traffic. As the number of independent competitive agents $N$ querying the system grows, the platform's ability to sustain a COI vanishes.
|
|
|
|
\begin{theorem}[COI Erosion in the Limit]
|
|
Let $N$ be the number of independent, utility-maximizing agents querying the platform. Let $p_{(1)}$ be the first order statistic (minimum) of the prices offered to these agents. As $N \to \infty$, the Cost of Information converges to 0.
|
|
\end{theorem}
|
|
|
|
\begin{proof}
|
|
Let $p_1, \ldots, p_N$ be independent and identically distributed (i.i.d.) price samples drawn from the policy's distribution $F(p)$ with support $[\underline{p}, \bar{p}]$. The realizable price for an optimal searching agent is the first order statistic $p_{(1)} = \min(p_1, \ldots, p_N)$.
|
|
|
|
The survival function (or reliability function) of the minimum price is given by:
|
|
\begin{equation}
|
|
S_{p_{(1)}}(t) = P(p_{(1)} > t) = [1 - F(t)]^N
|
|
\end{equation}
|
|
|
|
To determine the expected value $\mathbb{E}[p_{(1)}]$, we recall the property that for any continuous random variable $X$ with support $[A, B]$, the expectation can be expressed as the lower bound plus the integral of the survival function:
|
|
\begin{equation}
|
|
\mathbb{E}[X] = A + \int_{A}^{B} P(X > t) \, dt
|
|
\end{equation}
|
|
|
|
Applying this to our pricing statistic where the lower bound is $\underline{p}$:
|
|
\begin{align}
|
|
\mathbb{E}[p_{(1)}] &= \underline{p} + \int_{\underline{p}}^{\bar{p}} P(p_{(1)} > t) \, dt \\
|
|
&= \underline{p} + \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt
|
|
\end{align}
|
|
|
|
Since $F(t)$ is a valid CDF, for any $t > \underline{p}$, we have strict inequality $F(t) > 0$, implying $0 \le 1 - F(t) < 1$. By the properties of limits, as $N \to \infty$, the term $[1 - F(t)]^N$ converges to 0 pointwise for all $t > \underline{p}$.
|
|
|
|
Applying the Lebesgue Dominated Convergence Theorem (noting that the integrand is bounded by 1 on the finite interval $[\underline{p}, \bar{p}]$):
|
|
\begin{equation}
|
|
\lim_{N \to \infty} \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt = \int_{\underline{p}}^{\bar{p}} 0 \, dt = 0
|
|
\end{equation}
|
|
|
|
Substituting this back into the expression for COI:
|
|
\begin{align}
|
|
\lim_{N \to \infty} \text{COI} &= \lim_{N \to \infty} (\mathbb{E}[p_{(1)}] - \underline{p}) \\
|
|
&= \lim_{N \to \infty} \left( (\underline{p} + 0) - \underline{p} \right) \\
|
|
&= 0
|
|
\end{align}
|
|
\end{proof}
|
|
|
|
|
|
This result proves that standard pricing policies $\pi$ fail to extract surplus in the presence of large-scale agentic search, necessitating a robust counter-mechanism.
|
|
|
|
% The DRO objective creates a lower bound on COI extraction, effectively guaranteeing a minimum margin even in the presence of adversarial agents. we need to prove this and demonstrate that in a theorem.
|
|
|
|
|
|
%Mathematical demonstration and validation of the COI and citation backed evidence, and framework overview + show harm to user via other cost distortions. Maybe split into 3.2.1 (COI Theory) and 3.2.2 (Framework Design)
|
|
|
|
\subsection{System Architecture: Hybrid Kappa-Lambda Architecture}
|
|
|
|
In order for our research to have grounding in interactions we built a robust e-commerce web-platform. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. Our analysis revealed a clear industry standard: while both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment level argument which adjusts the proxy routing with a custom middleware inside next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application.
|
|
|
|
|
|
The architecture of this platform begins with the deployed web-apps posting interaction data to our backend which processes them and stores each ingested interaction into a kafka cluster. This serves as our data reservoir tracking and associating each interaction with its session and importantly with which experiment it belongs to. Not only do we track the behavioral interactions, but our pricing provider micro-service, once called by the frontend reports the observed/queried price-product into kafka. This kafka cluster is subscribed to by our pipeline which is configured on a schedule in Airflow, with the possibility of manual trigger. The final stage of the pricing pipeline, submits computed dynamic pricing results into a redis database for quick updates which is then read by the pricing provider and displayed on the webapp. This is a very generic end-to-end mechanism which is applicable to a variety of different e-commerce tasks. We intentionally put emphasis on the development of this infrastructure to establish a reproducible framework for interaction and to minimize any noise.
|
|
|
|
|
|
\subsubsection{DevOps Principles}
|
|
|
|
\subsubsection{Online Dynamic Pricing}
|
|
|
|
The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$. The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
|
|
|
|
\begin{equation}
|
|
\hat{p}_i = \begin{cases}
|
|
p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \theta_{\text{high}} \\
|
|
p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \theta_{\text{low}} \\
|
|
p_{0,i} & \text{otherwise}
|
|
\end{cases}
|
|
\quad \forall i \in \{1, \ldots, N\}
|
|
\end{equation}
|
|
|
|
where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\theta_{\text{high}}, \theta_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.
|
|
|
|
We will for our offilne experimental intents generalize a master function for encompasing distinct demand estimation and pricing strategies.
|
|
|
|
\begin{align}
|
|
V(\cdot) = \max_{p_t} \min_{Q \in \mathcal{U}(\hat{d})}{\mathbb{E}_{d\sim Q} [p_t \times d(p_t, x_t ; \theta) + \psi V_{t+1}(\cdot)]}
|
|
\end{align}
|
|
|
|
We follow differnet substitutouns which will server as hyperparameters later on.
|
|
|
|
\subsection{Experimental Design}
|
|
|
|
The experimentation begins with the design of goals, with careful consideration to assure a uniform spanning across different variables within each product-architecture of either the hotel or airline platforms. Our crafted collection of goals (jobs to be done) is then tracked in a postgress database with one table to track goals and another table to track different experiment runs, and their associated goals in a experiment-goal one-to-one relationship.
|
|
|
|
The purpose of this effort to gather data on interactions, is the first half of our research. With this collected data on behavioral characteristics, enhanced by our feature augmentation, we can create distribution separation into two bins $y \in \{A,H\}$ with a certain probability $p$ dependent on the session-specific features. To address the second loop of our system, we use this gained capability of discrimination to enhance the learner design involved in our surrogate dynamic pricing task which simulates an independent dynamic pricing scenario under which we can train a more controlled policy with the ability to account for true demand signals under conditions of contamination from non-human actors.
|
|
|
|
Our approach can be well summarized by a three-stage division, first we intend to observe and \textit{vectorize} the behavioral interaction data from our experiments, we then develop the separability which helps us deepen the semantic understanding of the behavioral patterns. Finally we use our newly gained learner to leverage a defensive mechanism within the simulation stage of a controlled dynamic pricing loop.
|
|
|
|
\begin{figure}[ht]
|
|
\resizebox{\columnwidth}{!}{%
|
|
\input{chapters/loop_figure.tex}
|
|
}
|
|
\caption{Overview of the Dynamic Pricing Tasks.}
|
|
\end{figure}
|
|
|
|
Our web platform (developed in similar patterns as the RecSim by \textcite{ie_recsim_2019}) allows us to setup a controled environment in which we assign tasks to human and agentic actors which are then carried out. Each actor gets a browser assigned experiment identification which is persistent across possibly multiple session identifiers. We then group by experiments and extract all the session interactions (trajectories) which follow the schema formalized below.
|
|
|
|
\subsubsection{Interaction Schema}
|
|
|
|
We extend the basic event tuple $e_{s,k}$ to capture the full observational signal available to the platform. An interaction event is defined as the extended tuple:
|
|
\begin{equation}
|
|
e_{s,k} = \left( a_{s,k}, \, i_{s,k}, \, t_{s,k}, \, \mu_{s,k}, \, \delta_{s,k} \right)
|
|
\end{equation}
|
|
where $\mu_{s,k} \in \mathcal{M}$ is a metadata record containing action-specific context (e.g., price observed, filter parameters, element text), and $\delta_{s,k} \in \mathbb{R}_+$ is the dwell time in milliseconds for attention-based actions.
|
|
|
|
A session $s$ is itself a structured record:
|
|
\begin{equation}
|
|
s = \left( \text{sid}, \, \text{eid}, \, t_0, \, \phi, \, \mathcal{U}, \, \tau_s \right)
|
|
\end{equation}
|
|
where $\text{sid}$ is a unique session identifier (UUID), $\text{eid}$ optionally links to an experiment, $t_0$ is the session start timestamp, $\phi \in \{\texttt{hotel}, \texttt{airline}\}$ denotes the platform mode, $\mathcal{U}$ is the user-agent string, and $\tau_s$ is the trajectory of events.
|
|
|
|
The action space $\mathcal{A}$ is partitioned into four semantic categories based on the behavioral signal each action conveys:
|
|
|
|
\begin{table}[ht]
|
|
\centering
|
|
\caption{Action space partition $\mathcal{A} = \mathcal{A}_{\text{nav}} \cup \mathcal{A}_{\text{cart}} \cup \mathcal{A}_{\text{filter}} \cup \mathcal{A}_{\text{dwell}}$ with signal interpretation.}
|
|
\label{tab:action_space}
|
|
\begin{tabular}{@{}llll@{}}
|
|
\toprule
|
|
\textbf{Category} & \textbf{Actions} & \textbf{Signal} & $\boldsymbol{\omega}$ \\
|
|
\midrule
|
|
$\mathcal{A}_{\text{cart}}$ & \texttt{add\_item}, \texttt{remove}, \texttt{checkout}, \texttt{purchase} & Purchase intent & High \\
|
|
$\mathcal{A}_{\text{dwell}}$ & \texttt{hover\_title}, \texttt{hover\_paragraph}, \texttt{hover\_link} & Sustained attention & Medium \\
|
|
$\mathcal{A}_{\text{nav}}$ & \texttt{page\_view}, \texttt{view\_item}, \texttt{learn\_more} & Discovery & Low \\
|
|
$\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{filter\_price}, \texttt{sort} & Preference refinement & Lowest \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table}
|
|
|
|
This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
|
|
|
|
The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.
|
|
|
|
In addition to behavioral events, the platform logs price observations to a separate Kafka topic. Each price query generates a record $(i, p, \text{sid}, \phi, t)$ associating the product, displayed price, requesting session, platform mode, and timestamp. This dual-stream architecture enables joint analysis of price exposure and behavioral response.
|
|
|
|
|
|
\subsection{Generative Contamination and Separability}
|
|
|
|
To develop a robust pricing learner, we require a simulation environment capable of generating realistic, contaminated interaction data. We achieve this by learning from our Phantom platform data using a two-stage approach.
|
|
|
|
|
|
|
|
\subsubsection{GOFAI-Based Separability}
|
|
We employ Good Old-Fashioned AI (GOFAI) heuristics to generate initial weak labels for separability. We define a set of rule-based predicates $\phi_j: \tau \to \{0, 1\}$ to partition the dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$. We construct distinct MDPs per each behavioral profile of humans and agents and from those we establish $D_{KL}$. From initial findings we compute a KL divergence of $\approx 2.0236$ across transition probabilities between states which can be seen in \ref{fig:human_mdp_viz} and \ref{fig:agent_mdp_viz}.
|
|
|
|
\begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
|
|
Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
|
|
\begin{equation}
|
|
D_{\mathrm{KL}}(P_e \parallel Q_e) = \sum_{k \in \mathcal{S}_e} P_e(k) \log \frac{P_e(k)}{Q_e(k)}
|
|
\end{equation}
|
|
where $\mathcal{S}_e$ denotes the set of destination events that follow $e$ in the human trajectories.
|
|
\end{definition}
|
|
|
|
To obtain this statistic we aggregate state transitions by their triggering event $e$ and treat the normalized outgoing probabilities as the categorical distributions $P_e$ (human) and $Q_e$ (agent). The computation intersects the event labels observed in both datasets, then iterates over each label and accumulates the log-ratio score. In practice this is implemented exactly as in models: for each destination $k$ we multiply the human probability by the log of the probability ratio and add the result to the running sum. Large contributions (including the case where $Q_e(k)$ is near zero) point to intents, such as rapid checkout or repeated navigation, that the agent policy fails to reproduce and therefore drive the contamination analysis.
|
|
|
|
With this divergence we train a contrastive learning method to estimate a weak probability of a given trajectory being an agent $f(\cdot) \to [0,1]$ which we can use as a leverage for a weighted sum. This is a first attempt at a more informed separability.
|
|
|
|
|
|
\subsubsection{Transition Probability Estimation}
|
|
\label{sec:tpe}
|
|
|
|
|
|
For both subsets, we model the session dynamics as a Markov Decision Process (MDP) and estimate the transition kernel $\mathcal{T}$. for each respective actor type we define $\hat{\mathcal{T}}_A$ and $\hat{\mathcal{T}}_H$ which are the general transition kernels subject to clustering into $\hat{\mathcal{T}}_y^i$ where $\forall i \in \text{behavioral clusters of } \hat{\mathcal{T}}_y$. This is done to avoid a lumping of all actor behavior and allows for more intral-class penalization. The probability of transitioning to state $s'$ given state $s$ is estimated via maximum likelihood:
|
|
\begin{equation}
|
|
\hat{P}(s' \mid s) = \frac{N(s, s')}{\sum_{k \in \mathcal{S}} N(s, k)}
|
|
\end{equation}
|
|
where $N(s, s')$ is the count of observed transitions. This allows us to construct a \textit{Contamination Generator} $\mathcal{G}(\alpha)$. In addition, given a clean trajectory dataset, $\mathcal{G}$ injects synthetic agent trajectories sampled from the learned transition matrix $\hat{P}_A$ until the effective mixing ratio reaches $\alpha$. From these transition probabilities we can observe an important feature which contributes to a differentiating assumption, which is that the mouse-behavior of an agent is almost non existent and therefore not utilized as a distinguishing factor both in the prior separability nor in any feature engineering.
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{chapters/mdp_human.pdf}
|
|
\caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for human actions.}
|
|
\label{fig:human_mdp_viz}
|
|
\end{figure}
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\includegraphics[width=0.8\textwidth]{chapters/mdp_agent.pdf}
|
|
\caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{agent} behavior profiles. The state space and transition probabilities are learned from observed session trajectories to enable generative contamination.}
|
|
\label{fig:agent_mdp_viz}
|
|
\end{figure}
|
|
|
|
|
|
\subsection{Stronger Classification}
|
|
We re-map the current event schema semantically to the event schema of another dataset. Our contaminated dataset is then used in another classifier where we can now also apply better feature engineering on other features while assigning correct lables to the entire dataset so the new dataset can be contaminated with $\mathcal{G}$ under some different contamination ratio $\alpha$.
|
|
|
|
This new classified can then be used in the reinforcement learning reward structure.
|
|
|
|
|
|
\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
|
|
|
|
We formulate the pricing problem as a Stackelberg Game where the Platform (Leader) sets prices $p_t$ and the Aggregate Demand (Follower) responds. However, the exact mixing parameter $\alpha$ and the demand distribution shift are non-stationary and unknown in online settings. Relying on a simple error term $\epsilon$ is insufficient. Instead, we adopt a Distributionally Robust Optimization (DRO) objective. To formulate the entire dependency chain from the trajctory $\tau^\prime$ which is a newly observed trajectory observed by the platform and generated by an unknown actor type (sampled over a behavioral profile defined in section \ref{sec:tpe}). As part of the dynamic pricing we need a mapping of demand parameterized by a trajectory and a price $\hat{Q}(p, \tau^\prime)$. For an observed trajectory we compute a new $\hat{\mathcal{T}}^\prime$ and using a baseline controlled observations of both $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$ we can compute during inference time the following:
|
|
|
|
\begin{align}
|
|
\label{eq:delta_H}
|
|
\Delta_H &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
|
|
\label{eq:delta_A}
|
|
\Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
|
|
\end{align}
|
|
|
|
This creates two centroid-like heuristics which can on a per-session granularity basis guide our mixing paramtere $\alpha$.
|
|
|
|
\subsubsection{Ambiguity Set Construction}
|
|
We define an ambiguity set $\mathcal{U}_p(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face:
|
|
\begin{equation}
|
|
\mathcal{U}_\epsilon(\hat{P}_N) = \left\{ Q \in \mathcal{P}(\Xi) : W_p(Q, \hat{P}_N) \le \epsilon \right\}
|
|
\end{equation}
|
|
This set captures all distributions that are statistically close to our observed training data but allows for adversarial shifts.
|
|
|
|
\subsubsection{The Min-Max Objective}
|
|
The robust policy $\pi^*$ is obtained by solving the maximin problem:
|
|
\begin{equation}
|
|
\label{eq:robust_policy}
|
|
\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}(p) \right]
|
|
\end{equation}
|
|
where $R(p, d)$ is the revenue function and $\lambda$ weighs the penalty for information leakage (COI). We previously defined $\text{COI}$, however to properly connect this concept into the reward structure we need to define a parametrized version which informs us of the leakage of said structure with $\text{COI}(p)$.
|
|
|
|
Another proposed formulation of the optimal policy would be to adjust the ambiguity set dyanmically over the live computed divergence where $\epsilon(\Delta_H)$ to adjust the ball around or estimator according to each behavioral signal emited through a given trajctory. We state this as a possibility but do not peruse it due to literature suggesting that wesserstine methods do not require absolute continuity and are better with ``black swans'' \parencite{kuhn_wasserstein_2024}.
|
|
|
|
\subsubsection{Actor Implementation}
|
|
In our simulation, the "Follower" is implemented as a set of Actors. Each Actor is initialized with a type $\theta$ which samples a specific demand curve $d(p; \theta)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.
|
|
|
|
|
|
As part of our reward engineering we think about the UX factor ($UX \in [0,1]$) whic his our proxy for user experience degradation, this is computed as a mixture of contribution from the separability model metric of $\frac{1}{\text{Specificity}}$.
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\resizebox{0.5\columnwidth}{!}{%
|
|
\input{chapters/balance_figure.tex}
|
|
}
|
|
\caption{Introducing the UX index allows us to better distinguish the kind of impact different methods have and allows us to compare them on this Pareto-like scale.}
|
|
\end{figure}
|
|
|
|
We also need to think about a policy like taxation to the agents Strategy-Proof Mechanism Design, specifically the Vickrey-Clarke-Groves (VCG) payment rule. We link and prove that this would create an incentive for the dominant strategy to become truth-telling.
|
|
|
|
\subsubsection{Pricing Mechanism Summary}
|
|
|
|
We now present the complete pricing mechanism that integrates the behavioral separability, contamination estimation, and robust optimization components developed in the preceding sections. Algorithm~\ref{alg:phantom_pricing_loop} formalizes the defensive pricing loop as a Stackelberg game where the platform (leader) sets prices and the aggregate demand (follower) responds through observed session trajectories.
|
|
|
|
\begin{algorithm}[t]
|
|
\caption{PHANTOM defensive pricing loop (bachelor-thesis level)}
|
|
\label{alg:phantom_loop_clean}
|
|
\DontPrintSemicolon
|
|
\SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
|
|
|
|
\Input{catalog size \(N\); costs \(c\); reference prices \(p^{ref}\); behavior models \(\bar T_H,\bar T_A\);
|
|
action weights \(\omega\); penalty \(\lambda\); horizon \(T\); sessions per step \(M\)}
|
|
\Output{price/demand trajectory \(\{(p_t,\hat Q_t,\hat\alpha_t)\}_{t=0}^{T-1}\)}
|
|
|
|
Initialize contamination estimate \(\hat\alpha \leftarrow 0.2\)\;
|
|
|
|
\For{\(t \leftarrow 0\) \KwTo \(T-1\)}{
|
|
|
|
set \(p_t \leftarrow \pi(\cdot) \) %c + (1 - \kappa \hat\alpha)\,(p^{ref}-c)\)\;
|
|
and clip \(p_t\) to a feasible range (e.g., near cost up to a max margin)\;
|
|
|
|
|
|
\(\hat Q_t \leftarrow 0\), \(\mathcal S_t \leftarrow \emptyset\); \tcp{Observe sessions and compute demand proxy (Eq.~2)}
|
|
\For{\(m \leftarrow 1\) \KwTo \(M\)}{
|
|
sample a session trajectory \(\tau_m\) using \(\bar T_H\) or \(\bar T_A\)\;
|
|
\(\hat Q_t \leftarrow \hat Q_t + \sum_{k}\omega(a_{m,k})\)\;
|
|
\(\mathcal S_t \leftarrow \mathcal S_t \cup \{\tau_m\}\)\;
|
|
}
|
|
|
|
\tcp{Estimate contamination from behavioral separability}
|
|
compute \(\hat\alpha \leftarrow \frac{1}{M}\sum_{\tau\in\mathcal S_t} \Big[\sigma\big(\beta(\Delta_H(\tau)-\Delta_A(\tau))\big)\Big]\)\;
|
|
|
|
compute \(J_t \leftarrow \text{Revenue}(p_t,\hat Q_t) - \lambda\cdot \text{COILeak}(\hat\alpha)\)\;
|
|
}
|
|
\end{algorithm}
|
|
|
|
|
|
The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform publishes prices (leader move), observes the resulting session trajectories (follower response), and updates its contamination estimate based on behavioral divergence from the learned human and agent transition kernels $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$. The history buffer $\mathcal{L}$ (termed ``Limbo'' in our implementation) enforces the alternating Stackelberg structure by maintaining the temporal sequence of price publications and demand observations.
|
|
|
|
%The defensive price update in Line 24 implements a contamination-aware margin shrinkage: as the estimated agent contamination $\hat{\alpha}_t$ increases, the margin $(p^{\mathrm{ref}} - c)$ is proportionally reduced by factor $\kappa \in [0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring prices remain within the feasible set $\mathcal{P}$. In subsequent experiments, this heuristic update is replaced by the DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}, which optimizes against the Wasserstein ambiguity set $\mathcal{U}_\epsilon$ rather than relying on a fixed margin adjustment rule.
|
|
|
|
\section{Heuristics as part of neuro-inspired steering systems}
|
|
|
|
Steve Burns, superior culliculus (face heuristics) we create this sort of part of the 'brain' + amortized inference.
|
|
|
|
We could say that a DQN for example is the learnin subsystem and then within our reward mechanism or some other computational method we introduce a steering subsystem which acts as the proposed ``pricing heuristic'' against the given non human transaction data.
|
|
|
|
\section{Market construction}
|