catchup: rogue scripts

This commit is contained in:
2026-02-27 12:45:46 +01:00
parent e8a9716f69
commit 5444a4ea13
27 changed files with 6908 additions and 2 deletions

View File

@@ -0,0 +1,69 @@
\section{Problem Formulation: A Stackelberg Game Approach}
\label{sec:math_formulation}
We formalize the interaction between the dynamic pricing system and non-human actors as a \textit{Stackelberg Game} (Leader-Follower) with incomplete information. This framework captures the hierarchical nature of the problem: the Platform (Leader) sets a pricing policy, and the Actors (Followers)---both Humans and Agents---observe these prices and react strategically.
\subsection{The Players and Objectives}
Let $t \in \{1, \dots, T\}$ denote discrete time steps. At each step, the system interactions are defined by the following entities:
\paragraph{1. The Leader (The Platform)}
The e-commerce platform acts as the leader, choosing a pricing policy $\pi$ to maximize total expected revenue. At time $t$, given a state $s_t \in \mathcal{S}$ (representing inventory, time of day, and historical interactions), the platform sets a price $p_t \in [p_{\min}, p_{\max}]$.
The platform's goal is to maximize the cumulative revenue from genuine human transactions while mitigating the distortion caused by agent interactions.
\paragraph{2. The Followers (The Demand Mixture)}
The observed demand is not a monolithic signal but a mixture of two distinct populations with divergent objective functions. Let $u$ denote an incoming actor. The type of the actor $\theta \in \{H, A\}$ is a latent variable, where $H$ denotes a Human and $A$ denotes an Agent.
\begin{itemize}
\item \textbf{The Human ($H$):} Acts as a \textit{myopic utility maximizer}. A human $i$ has a private valuation $v_i$ for the product. They execute a purchase decision $d_i \in \{0, 1\}$ based on the consumer surplus:
\begin{equation}
d_i(p_t) = \mathbb{I}(v_i - p_t \geq 0)
\end{equation}
where $\mathbb{I}(\cdot)$ is the indicator function. The aggregate human demand $q_H(p_t)$ follows a standard downward-sloping demand curve $D(p_t)$.
\item \textbf{The Agent ($A$):} Acts as an \textit{information maximizer} (reconnaissance). The agent does not intend to purchase at the displayed price $p_t$ unless an arbitrage condition is met. Instead, the agent generates interaction events (queries) to estimate the platform's pricing function $f(p)$. The agent's reward function $R_A$ is defined by Information Gain:
\begin{equation}
R_A(p_t) = H(\mathcal{P}) - H(\mathcal{P} \mid p_t) - c_{query}
\end{equation}
where $H(\mathcal{P})$ is the entropy of the agent's belief regarding the price distribution, and $c_{query}$ is the marginal cost of interaction (assumed $\approx 0$ for LLMs).
\end{itemize}
\subsection{The Demand Contamination Model}
% MAYBE alpha has to be \lambda which we also need to formally define still
The core difficulty in this setting is that the platform observes only the aggregate interaction volume $\hat{q}_t$, which is a contaminated signal. Let $\alpha_t \in [0, 1]$ represent the proportion of traffic generated by agents at time $t$. The observed signal is:
\begin{equation}
\hat{q}_t(p_t) = (1 - \alpha_t) \cdot q_H(p_t) + \alpha_t \cdot q_A(p_t) + \epsilon_t
\end{equation}
where:
\begin{itemize}
\item $q_H(p_t)$ is the \textit{true signal} (conversion intent).
\item $q_A(p_t)$ is the \textit{adversarial noise} (reconnaissance queries).
\item $\epsilon_t$ is random market noise.
\end{itemize}
Crucially, $q_A(p_t)$ is often inversely correlated with $q_H(p_t)$ in terms of utility; agents may flood the system with queries during high-volatility periods to map price boundaries, artificially inflating $\hat{q}_t$ without converting.
\subsection{The Optimization Objective: Robust Revenue}
Standard dynamic pricing algorithms (e.g., Thompson Sampling or UCB) assume $\alpha_t = 0$, estimating demand $\hat{D}(p) \approx \mathbb{E}[\hat{q} | p]$. In the presence of agents ($\alpha_t > 0$), this estimator becomes biased, leading to the \textit{Cost of Information} (COI) defined in Section 3.2.
We propose a robust optimization objective. The platform seeks a pricing policy $\pi^*$ that maximizes worst-case revenue over a statistically plausible set of contamination rates $\alpha$:
\begin{equation}
\pi^* = \argmax_{\pi} \sum_{t=1}^T \mathbb{E}_{s_t} \left[ \min_{\alpha} \left( p_t \cdot \hat{q}_t(p_t | \theta=H) \right) - \lambda \cdot \mathcal{L}_{detect}(\hat{q}_t) \right]
\end{equation}
Here:
\begin{itemize}
\item The first term, $p_t \cdot \hat{q}_t(p_t | \theta=H)$, represents the revenue generated strictly from the estimated human segment.
\item $\mathcal{L}_{detect}$ is a penalty term for failing to separate distributions (the cost of confusion).
\item $\lambda$ is a hyperparameter balancing revenue exploitation vs. robust detection.
\end{itemize}
This formulation effectively transforms the pricing problem into a \textit{Distributionally Robust Optimization (DRO)} problem, where the learner must guard against adversarial perturbations (Agent traffic) in the observed demand distribution.

BIN
paper/src/graphics/gcp.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

BIN
paper/src/graphics/gcp.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB