\section{Problem Formulation: A Stackelberg Game Approach} \label{sec:math_formulation} We formalize the interaction between the dynamic pricing system and non-human actors as a \textit{Stackelberg Game} (Leader-Follower) with incomplete information. This framework captures the hierarchical nature of the problem: the Platform (Leader) sets a pricing policy, and the Actors (Followers)---both Humans and Agents---observe these prices and react strategically. \subsection{The Players and Objectives} Let $t \in \{1, \dots, T\}$ denote discrete time steps. At each step, the system interactions are defined by the following entities: \paragraph{1. The Leader (The Platform)} The e-commerce platform acts as the leader, choosing a pricing policy $\pi$ to maximize total expected revenue. At time $t$, given a state $s_t \in \mathcal{S}$ (representing inventory, time of day, and historical interactions), the platform sets a price $p_t \in [p_{\min}, p_{\max}]$. The platform's goal is to maximize the cumulative revenue from genuine human transactions while mitigating the distortion caused by agent interactions. \paragraph{2. The Followers (The Demand Mixture)} The observed demand is not a monolithic signal but a mixture of two distinct populations with divergent objective functions. Let $u$ denote an incoming actor. The type of the actor $\theta \in \{H, A\}$ is a latent variable, where $H$ denotes a Human and $A$ denotes an Agent. \begin{itemize} \item \textbf{The Human ($H$):} Acts as a \textit{myopic utility maximizer}. A human $i$ has a private valuation $v_i$ for the product. They execute a purchase decision $d_i \in \{0, 1\}$ based on the consumer surplus: \begin{equation} d_i(p_t) = \mathbb{I}(v_i - p_t \geq 0) \end{equation} where $\mathbb{I}(\cdot)$ is the indicator function. The aggregate human demand $q_H(p_t)$ follows a standard downward-sloping demand curve $D(p_t)$. \item \textbf{The Agent ($A$):} Acts as an \textit{information maximizer} (reconnaissance). The agent does not intend to purchase at the displayed price $p_t$ unless an arbitrage condition is met. Instead, the agent generates interaction events (queries) to estimate the platform's pricing function $f(p)$. The agent's reward function $R_A$ is defined by Information Gain: \begin{equation} R_A(p_t) = H(\mathcal{P}) - H(\mathcal{P} \mid p_t) - c_{query} \end{equation} where $H(\mathcal{P})$ is the entropy of the agent's belief regarding the price distribution, and $c_{query}$ is the marginal cost of interaction (assumed $\approx 0$ for LLMs). \end{itemize} \subsection{The Demand Contamination Model} % MAYBE alpha has to be \lambda which we also need to formally define still The core difficulty in this setting is that the platform observes only the aggregate interaction volume $\hat{q}_t$, which is a contaminated signal. Let $\alpha_t \in [0, 1]$ represent the proportion of traffic generated by agents at time $t$. The observed signal is: \begin{equation} \hat{q}_t(p_t) = (1 - \alpha_t) \cdot q_H(p_t) + \alpha_t \cdot q_A(p_t) + \epsilon_t \end{equation} where: \begin{itemize} \item $q_H(p_t)$ is the \textit{true signal} (conversion intent). \item $q_A(p_t)$ is the \textit{adversarial noise} (reconnaissance queries). \item $\epsilon_t$ is random market noise. \end{itemize} Crucially, $q_A(p_t)$ is often inversely correlated with $q_H(p_t)$ in terms of utility; agents may flood the system with queries during high-volatility periods to map price boundaries, artificially inflating $\hat{q}_t$ without converting. \subsection{The Optimization Objective: Robust Revenue} Standard dynamic pricing algorithms (e.g., Thompson Sampling or UCB) assume $\alpha_t = 0$, estimating demand $\hat{D}(p) \approx \mathbb{E}[\hat{q} | p]$. In the presence of agents ($\alpha_t > 0$), this estimator becomes biased, leading to the \textit{Cost of Information} (COI) defined in Section 3.2. We propose a robust optimization objective. The platform seeks a pricing policy $\pi^*$ that maximizes worst-case revenue over a statistically plausible set of contamination rates $\alpha$: \begin{equation} \pi^* = \argmax_{\pi} \sum_{t=1}^T \mathbb{E}_{s_t} \left[ \min_{\alpha} \left( p_t \cdot \hat{q}_t(p_t | \theta=H) \right) - \lambda \cdot \mathcal{L}_{detect}(\hat{q}_t) \right] \end{equation} Here: \begin{itemize} \item The first term, $p_t \cdot \hat{q}_t(p_t | \theta=H)$, represents the revenue generated strictly from the estimated human segment. \item $\mathcal{L}_{detect}$ is a penalty term for failing to separate distributions (the cost of confusion). \item $\lambda$ is a hyperparameter balancing revenue exploitation vs. robust detection. \end{itemize} This formulation effectively transforms the pricing problem into a \textit{Distributionally Robust Optimization (DRO)} problem, where the learner must guard against adversarial perturbations (Agent traffic) in the observed demand distribution.