extra math formulations and refferenceot DRO

2026-06-01 00:53:36 +00:00 · 2025-12-16 11:49:04 +01:00
parent 0a1149b460
commit 1faabf627c
3 changed files with 57 additions and 16 deletions
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -13,17 +13,25 @@ What the platform observes is the interaction logs $\tau_s$, price query logs an

 Each interaction $i$ gives us some information about the willingness to pay ($v$) of a given customer, which we can try to estimate and measure against the true baseline.

-$$
+\begin{equation}
 I(\tau) = \mathbb{E}[v \vert \tau] - \mathbb{E}[v]
-$$
+\end{equation}

 This lets us formalize the quality of our proxy $\hat{v}$ about the true $v$ from observing $\tau$ from any session $s$

 \subsubsection{Proxy Definition for Demand Estimation}
 Our proxy estimator is a critical component which has direct impact all downstream tasks, we start with a mapping of weights $\omega: \mathcal{A} \to \mathbb{R}_+$ where for an epoch $t$ and product $i$ the observed demand proxy of a session $s$ looks like:
-$$
+
+\begin{equation}
 \hat{q}_{t,i} = \sum_{e_{s,k}\in t} \omega(a_{s,k}) \cdot \mathbf{1} [i_{s,k}=i]
-$$
+\end{equation}
+
+
+
+\subsubsection{Game Theoretic Approach: A Stackelberg Game}
+
+What we define in this game is the interaction between the pricing system and non-human actors in a Leader-Follower dynamic with partial observability. This lets us capture the nature of the problem in a hierarchical manner wit the platform being the Leader and the Actor is the follower, where both the Humans and Agents observe the prices set by the platforms policy and react strategically.
+


 \subsection{Cost of Information Framework}
@@ -31,9 +39,9 @@ $$

 The Cost of Information proposed in our research serves as proxy to understand and represent the complex interaction patterns between humans and agents. It is the expected markup a platform applies to a product from derived demand signals.

-$$
+\begin{equation}
 COI(\tau) = \mathbb{E}[p(\tau)] - p_0
-$$
+\end{equation}

 Where the $p_0$ vector is both the initial state of the system and the base price for each product. We also define a pricing method at any time $t$ as $t: p_t \in \mathbb{R}_+^N$, satisfying a discrete cap $\{p \in \mathbb{R}_+^N \vert \quad \underline{p} \leq p \leq \overline{p}\}$ which act as our business constraints, limiting prices to the range of $(\underline{p}, \overline{p})$. We treat $p_t$ as the price vector shown to the an actor both experimentally and in-simulation.

@@ -87,9 +95,12 @@ Study methodology and approach. Data acquisition strategy. Defined objectives an

 With data collected from our platform we have a series of observed interactions, with each interaction having a mapping to a specific \texttt{sessionId} and \texttt{experimentId} which allows us to join all components of the experiment design into an information rich feature vector for each session in our observed data. To develop more explicitly the demand estimation, we propose a decomposition of the proxy $\hat{q}_t$ into two latent components:

-$$
-\hat{q}_t = \hat{q}_t^H + \hat{q}_t^A
-$$
+\begin{equation}
+\hat{q}_t = \hat{q}_t^H + \hat{q}_t^A + \epsilon_t
+\end{equation}
+
+Additionally we take into account some degree of random market noise $\epsilon_t$. We can formally define $\hat{q}_t^H$ to be the true signal with conversion intent and the agent component is adversarial noise.
+


 \subsubsection{Feature Development}
@@ -102,6 +113,16 @@ On the other hand, a more lax system without detection (myopic) defines the lowe


 \subsection{Dynamic Pricing Algorithm Analysis}
+
+From the perspective of agent contamination, which we define by $\alpha \ in [0,1]$, representing the proportion of traffic generated by agents, the observed signal can be parameterized by this:
+
+
+\begin{equation}
+\hat{q}_t = (1-\alpha) \cdot \hat{q}_t^H + \alpha \cdot \hat{q}_t^A + \epsilon_t
+\end{equation}
+
+The default assumption of a dynamic pricing algorithm assumes $\alpha = 0$ estimating demand $\hat{D}(p) \approx \mathbb{E}[\hat{q} \vert p]$, whereas in the presence of agents our alpha is a non-zero component. In this case the estimator becomes biased, leading to the emergence of our defined COI.
+
 Deep dive into how the algorithm works, different kinds and justification for chosen approaches + agent impact modeling and quantification.

 \subsection{Reinforcement Learning Formulation}
@@ -109,18 +130,28 @@ Deep dive into how the algorithm works, different kinds and justification for ch
 We define our surrogate commercial environment within which we can accurately control for all the variables such as the true demand, providing a clear transparency of the entire system. We start with a product catalogue of size $N$ with random supply initialization per-product. At every step the commercial simulation receives a price vector $p$ according to which we simulate a set of interactions $I$ with a certain proportion $l_a$ of agents contributing interactions. The interactions serve as a proxy to estimating the true demand $q(p)$ which is composed of two separate demand generators $q_A(p)$ and $q_H(p)$.
 On top of this our gym environment has a built demand estimator callback which is defined individually by each pricing engine. This engine is constructed to interact with the gym environment with the gym environment at each step running a cycle via the commercial environment, creating an observation of all the interactions $I$ and a baseline vector which tells us the ground truth of demand, sales statistic and revenue. The engine is then responsible for learning the pricing policy providing a pricing vector $p_{t+1}$ motivated by a per-episode summary reward composed by.

-$$
+
+\begin{equation}
 R = \text{revenue} - \text{COI} - \text{UX friction index}
-$$
+\end{equation}


-As part of our reward engineering we want to take into account the cost of information in our reward with a weight.
+As part of our reward engineering we want to take into account the cost of information in our reward with a weight. As seen in most other dynamic pricing systems, regret is most often use to guide the policy development, which in our case serves very well in comparing the ground truth and estimated demand. For us the regret is the revenue loss compared to the oracle which has perfect information access.
+
+\begin{equation}
+ \text{Regret}(\p\i) = TR(\pi_\text{oracle}) - TR(\pi)
+\end{equation}
+% TR= total revenue
+% Regret is the revenue loss compared to oracle with perfect information:
+
+We also need a regert bound


 Our pricing engine can be modeled by the mapping:
-$$
+
+\begin{equation}
 \pi : \mathbb{R}^N_+ \times \mathcal{H}_t \to \mathbb{R}_+^N
-$$
+\end{equation}

 where $\mathcal{H}_t$ is the history and state we keep track of, allowing us to define a progression of prices as $p_{t+1} \gets \pi(\hat{q}_t,\mathcal{H}_t)$. With this we can establish that $\tau$ influences $p_{t+1}$ through $\hat{q}_t$