extra math formulations and refferenceot DRO

This commit is contained in:
2025-12-16 11:49:04 +01:00
parent 0a1149b460
commit 1faabf627c
3 changed files with 57 additions and 16 deletions

View File

@@ -96,3 +96,9 @@
title = {A Reinforcement Learning Approach to Dynamic Pricing},
volume = {16}
}
@techReport{Kuhn2025,
abstract = {Distributionally robust optimization (DRO) studies decision problems under uncertainty where the probability distribution governing the uncertain problem parameters is itself uncertain. A key component of any DRO model is its ambiguity set, that is, a family of probability distributions consistent with any available structural or statistical information. DRO seeks decisions that perform best under the worst distribution in the ambiguity set. This worst case criterion is supported by findings in psychology and neuroscience, which indicate that many decision-makers have a low tolerance for distributional ambiguity. DRO is rooted in statistics, operations research and control theory, and recent research has uncovered its deep connections to regularization techniques and adversarial training in machine learning. This survey presents the key findings of the field in a unified and self-contained manner.},
author = {Daniel Kuhn and Soroosh Shafiee and Wolfram Wiesemann},
title = {Distributionally Robust Optimization},
year = {2025}
}

View File

@@ -21,7 +21,7 @@ The key stakeholders affected by the threat of increasing agent-driven traffic i
The industry has already seen legal action in cases like Amazon against Perplexity \cite{AmazonvsPerplexity}, stemming from the difficulty of identifying traffic from hybrid systems like the Commet browser. This paper explores such systems to better understand what the interaction data looks like and what it means for dynamic pricing and recommendation systems downstream. This observed impact indicates a need for prevention of secondary negative effects on the ``legacy'' systems which power modern revenue sources for many companies. Dynamic pricing algorithms rely on directly translating demand features $q$ to new price assignments $\hat{p}$ across a catalogue of products of size $N$.
\subsection{Solution Space Overview}
Dynamic pricing systems, as presented in \cite{Mueller2019}, often deal with sparse low-rank data of demand signals which, combined with contamination from agents, creates complex interactions that impact pricing. To further complicate the problem, certain commercial settings such as the one presented in \cite{Amjad2017} must address the true demand of products under censored observations. This provides a formulation for handling demand in our case with multiple kinds of commercial mediators: $\hat{q} \gets q_A + q_H$ where $q_A$ represents the distribution of demand generated by agentic mediators and $q_H$ represents that of true human demand.
Dynamic pricing systems, as presented in \cite{Mueller2019}, often deal with sparse low-rank data of demand signals which, combined with contamination from agents, creates complex interactions that impact pricing. To further complicate the problem, certain commercial settings such as the one presented in \cite{Amjad2017} must address the true demand of products under censored observations. This provides a formulation for handling demand in our case with multiple kinds of commercial mediators: $\hat{q} \gets q_A + q_H$ where $q_A$ represents the distribution of demand generated by agentic mediators and $q_H$ represents that of true human demand, these are two distinct populations with divergent objective functions.
We formally define interaction data as coming from some actor which can either be an agent ($A$) or human ($H$). For purposes of this research, an agent is an algorithmic loop with the ability to access a web platform and perform actions such as clicks, scrolls, and input field fills. The loop terminates when the internal large language model judges the provided task definition as complete. A detailed breakdown can be found in \cref{algagent-loop}.
@@ -54,4 +54,8 @@ Extract final result $r$ from terminal state\;
\end{algorithm}
The previously described goal of separability allows us to formulate a task which entails taking raw interaction data for either actor and creating a composite demand estimate $\hat{q}$.
The previously described goal of separability allows us to formulate a task which entails taking raw interaction data for either actor and creating a composite demand estimate $\hat{q}$. We propose a robust optimization objective defined in our methodology, transforming the pricing problem into a form of Distributionally Robust Optimization \cite{Kuhn2025} where the learner must guard against adversarial contamination in observed demand distributors.
% A Distributionally Robust Optimization (DRO) problem is fundamentally about making decisions that perform well not just for a single estimated probability distribution, but for any distribution within a plausible set (called the "Ambiguity Set").
% In standard optimization, you assume you know the distribution of your data (e.g., "Demand is Gaussian with mean μ") and you optimize for the average case. In DRO, you admit you don't know the exact distribution—perhaps the mean shifts, or the tail is heavier. You optimize for the worst-case distribution within your uncertainty set.
% he observed demand q^t is a mixture of two distributions: The parameter αt (the percentage of traffic that is non-human) is unknown and non-stationary. It defines the distribution of the data you observe.

View File

@@ -13,17 +13,25 @@ What the platform observes is the interaction logs $\tau_s$, price query logs an
Each interaction $i$ gives us some information about the willingness to pay ($v$) of a given customer, which we can try to estimate and measure against the true baseline.
$$
\begin{equation}
I(\tau) = \mathbb{E}[v \vert \tau] - \mathbb{E}[v]
$$
\end{equation}
This lets us formalize the quality of our proxy $\hat{v}$ about the true $v$ from observing $\tau$ from any session $s$
\subsubsection{Proxy Definition for Demand Estimation}
Our proxy estimator is a critical component which has direct impact all downstream tasks, we start with a mapping of weights $\omega: \mathcal{A} \to \mathbb{R}_+$ where for an epoch $t$ and product $i$ the observed demand proxy of a session $s$ looks like:
$$
\begin{equation}
\hat{q}_{t,i} = \sum_{e_{s,k}\in t} \omega(a_{s,k}) \cdot \mathbf{1} [i_{s,k}=i]
$$
\end{equation}
\subsubsection{Game Theoretic Approach: A Stackelberg Game}
What we define in this game is the interaction between the pricing system and non-human actors in a Leader-Follower dynamic with partial observability. This lets us capture the nature of the problem in a hierarchical manner wit the platform being the Leader and the Actor is the follower, where both the Humans and Agents observe the prices set by the platforms policy and react strategically.
\subsection{Cost of Information Framework}
@@ -31,9 +39,9 @@ $$
The Cost of Information proposed in our research serves as proxy to understand and represent the complex interaction patterns between humans and agents. It is the expected markup a platform applies to a product from derived demand signals.
$$
\begin{equation}
COI(\tau) = \mathbb{E}[p(\tau)] - p_0
$$
\end{equation}
Where the $p_0$ vector is both the initial state of the system and the base price for each product. We also define a pricing method at any time $t$ as $t: p_t \in \mathbb{R}_+^N$, satisfying a discrete cap $\{p \in \mathbb{R}_+^N \vert \quad \underline{p} \leq p \leq \overline{p}\}$ which act as our business constraints, limiting prices to the range of $(\underline{p}, \overline{p})$. We treat $p_t$ as the price vector shown to the an actor both experimentally and in-simulation.
@@ -87,9 +95,12 @@ Study methodology and approach. Data acquisition strategy. Defined objectives an
With data collected from our platform we have a series of observed interactions, with each interaction having a mapping to a specific \texttt{sessionId} and \texttt{experimentId} which allows us to join all components of the experiment design into an information rich feature vector for each session in our observed data. To develop more explicitly the demand estimation, we propose a decomposition of the proxy $\hat{q}_t$ into two latent components:
$$
\hat{q}_t = \hat{q}_t^H + \hat{q}_t^A
$$
\begin{equation}
\hat{q}_t = \hat{q}_t^H + \hat{q}_t^A + \epsilon_t
\end{equation}
Additionally we take into account some degree of random market noise $\epsilon_t$. We can formally define $\hat{q}_t^H$ to be the true signal with conversion intent and the agent component is adversarial noise.
\subsubsection{Feature Development}
@@ -102,6 +113,16 @@ On the other hand, a more lax system without detection (myopic) defines the lowe
\subsection{Dynamic Pricing Algorithm Analysis}
From the perspective of agent contamination, which we define by $\alpha \ in [0,1]$, representing the proportion of traffic generated by agents, the observed signal can be parameterized by this:
\begin{equation}
\hat{q}_t = (1-\alpha) \cdot \hat{q}_t^H + \alpha \cdot \hat{q}_t^A + \epsilon_t
\end{equation}
The default assumption of a dynamic pricing algorithm assumes $\alpha = 0$ estimating demand $\hat{D}(p) \approx \mathbb{E}[\hat{q} \vert p]$, whereas in the presence of agents our alpha is a non-zero component. In this case the estimator becomes biased, leading to the emergence of our defined COI.
Deep dive into how the algorithm works, different kinds and justification for chosen approaches + agent impact modeling and quantification.
\subsection{Reinforcement Learning Formulation}
@@ -109,18 +130,28 @@ Deep dive into how the algorithm works, different kinds and justification for ch
We define our surrogate commercial environment within which we can accurately control for all the variables such as the true demand, providing a clear transparency of the entire system. We start with a product catalogue of size $N$ with random supply initialization per-product. At every step the commercial simulation receives a price vector $p$ according to which we simulate a set of interactions $I$ with a certain proportion $l_a$ of agents contributing interactions. The interactions serve as a proxy to estimating the true demand $q(p)$ which is composed of two separate demand generators $q_A(p)$ and $q_H(p)$.
On top of this our gym environment has a built demand estimator callback which is defined individually by each pricing engine. This engine is constructed to interact with the gym environment with the gym environment at each step running a cycle via the commercial environment, creating an observation of all the interactions $I$ and a baseline vector which tells us the ground truth of demand, sales statistic and revenue. The engine is then responsible for learning the pricing policy providing a pricing vector $p_{t+1}$ motivated by a per-episode summary reward composed by.
$$
\begin{equation}
R = \text{revenue} - \text{COI} - \text{UX friction index}
$$
\end{equation}
As part of our reward engineering we want to take into account the cost of information in our reward with a weight.
As part of our reward engineering we want to take into account the cost of information in our reward with a weight. As seen in most other dynamic pricing systems, regret is most often use to guide the policy development, which in our case serves very well in comparing the ground truth and estimated demand. For us the regret is the revenue loss compared to the oracle which has perfect information access.
\begin{equation}
\text{Regret}(\p\i) = TR(\pi_\text{oracle}) - TR(\pi)
\end{equation}
% TR= total revenue
% Regret is the revenue loss compared to oracle with perfect information:
We also need a regert bound
Our pricing engine can be modeled by the mapping:
$$
\begin{equation}
\pi : \mathbb{R}^N_+ \times \mathcal{H}_t \to \mathbb{R}_+^N
$$
\end{equation}
where $\mathcal{H}_t$ is the history and state we keep track of, allowing us to define a progression of prices as $p_{t+1} \gets \pi(\hat{q}_t,\mathcal{H}_t)$. With this we can establish that $\tau$ influences $p_{t+1}$ through $\hat{q}_t$