diff --git a/paper/src/chapters/03-methodology.tex b/paper/src/chapters/03-methodology.tex
index 182e9cf..604edfa 100644
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -1,97 +1,94 @@
 \section{Methodology}
 
+This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven separability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.
+
 \subsection{Problem Formalization}
 
-In a commercial setting we can collect behavioral data on any actors interactions within a platform we have control over. This collection is done through sessions such each session belongs to an actor class $Y_s \in \{H,A\}$ with randomized assignment. This lets us build a trajectory $\tau_s$ of observable interaction events $\tau_s=(e_{s,1},\ldots,e_{s,L_s})$ where each event is defined as $e_{s,k} = (a_{s,k},i_{s,k},t_{s,k})$. We additionally define the rest of the components in each event accordingly:
+We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
+
+Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
+\begin{equation}
+    e_{s,k} = (a_{s,k}, i_{s,k}, t_{s,k})
+\end{equation}
+where:
 \begin{itemize}
-\item $a_{s,k} \in \mathcal{A}$ where $\mathcal{A} = \{\text{page\_view}, \text{view\_item\_page}, \text{add\_item}\}$. % TODO: translate all from /home/velocitatem/Documents/Projects/PHANTOM/web/src/lib/events.ts into this latex
-\item $i_{s,k} \in \{1, \ldots, N\}$ which is the product association per-event (if applicable).
-\item $t_{s,k}$ which is the timestamp mapped to the session.
+    \item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
+    \item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
+    \item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
 \end{itemize}
 
-What the platform observes is the interaction logs $\tau_s$, price query logs and purchase signals. It is important to note that our pricing pipeline works not directly with observed true human demand, but rather a behavioral proxy which is a composite of $q_H+q_A$.
-
-Each interaction $i$ gives us some information about the willingness to pay ($v$) of a given customer, which we can try to estimate and measure against the true baseline.
-
+The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
 \begin{equation}
-I(\tau) = \mathbb{E}[v \vert \tau] - \mathbb{E}[v]
+    \hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbb{1}[i_{s,k} = i]
 \end{equation}
+where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay.
 
-This lets us formalize the quality of our proxy $\hat{v}$ about the true $v$ from observing $\tau$ from any session $s$
-
-\subsubsection{Proxy Definition for Demand Estimation}
-Our proxy estimator is a critical component which has direct impact all downstream tasks, we start with a mapping of weights $\omega: \mathcal{A} \to \mathbb{R}_+$ where for an epoch $t$ and product $i$ the observed demand proxy of a session $s$ looks like:
-
+\subsubsection{Actor Types and Demand Curves}
+We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the mixture:
 \begin{equation}
-\hat{q}_{t,i} = \sum_{e_{s,k}\in t} \omega(a_{s,k}) \cdot \mathbf{1} [i_{s,k}=i]
-\end{equation}
-
-
-
-\subsubsection{Game Theoretic Approach: A Stackelberg Game}
-
-What we define in this game is the interaction between the pricing system and non-human actors in a Leader-Follower dynamic with partial observability. This lets us capture the nature of the problem in a hierarchical manner wit the platform being the Leader and the Actor is the follower, where both the Humans and Agents observe the prices set by the platforms policy and react strategically.
-
-
-
-Putting it all together for formalization, we have a complete mapping of our pipeline
-
-\begin{equation}
-  \tau \to x_s \to \hat{\pi} \to \tilde{q_t} \to p_{t+1} \\
-  p_{t+i}(\tau) = \hat{\pi}(x_s) \\
-  % explixitly fully develop an expansion of showing the mappin from p to tau and how that carries all information and from that we can identify where to intercept with our treatments.
+    Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
 \end{equation}
+where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.
 
 
 
 
-\subsection{Cost of Information Framework}
+\subsection{Cost of Information (COI) Framework}
 
+The \textit{Cost of Information} (COI) represents the markup a pricing policy $\pi$ attempts to extract from the market by leveraging demand signals. We define COI as the expected premium over the minimum viable price $\underline{p}$ (or marginal cost).
 
-The Cost of Information proposed in our research serves as proxy to understand and represent the complex interaction patterns between humans and agents. It is the expected markup a platform applies to a product from derived demand signals.
-
+\begin{definition}[Cost of Information]
+Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
 \begin{align}
-COI &= \rho - p_\text{min} \\
-&= \mathbb{E}[P(\tau)] - p_\text{min} \\
-&= \mathbb{E}_{p\sim\pi(\tau)}[p] - \min_{\tau^\prime\in\boldsymbol{\tau}}{\mathbb{E}_{p\sim\pi(\tau^\prime)}[p]}
+    \text{COI} &= \mathbb{E}[P] - \underline{p} \\
+               &= \int_{\underline{p}}^{\bar{p}} (1 - F_\pi(p)) \, dp
 \end{align}
+where $F_\pi(p)$ is the cumulative distribution function of prices generated by $\pi$ under standard operating conditions.
+\end{definition}
 
-Where the $p_0$ vector is both the initial state of the system and the base price for each product. We also define a pricing method at any time $t$ as $p_t \in \mathbb{R}_+^N$, satisfying a discrete cap $\{p \in \mathbb{R}_+^N \mid \underline{p} \leq p \leq \overline{p}\}$ which act as our business constraints, limiting prices to the range of $(\underline{p}, \overline{p})$. We treat $p_t$ as the price vector shown to an actor both experimentally and in-simulation.
+We now formally demonstrate that standard dynamic pricing mechanisms are not incentive-compatible with high-frequency agentic traffic. As the number of independent competitive agents $N$ querying the system grows, the platform's ability to sustain a COI vanishes.
 
-Per product we follow a cumulative distrubtion $F(p)$ which we can leverage to prove the existence of COI under certain conditions of agent contamination. We state that:
-% Unify notation of underline p and p_min which now means same things
-\begin{align}
-\int_{\underline{p}}^{\rho} (\rho - p) \, dF(p) &= c \\
-\int_{\underline{p}}^{\underline{p} + \text{COI}} F(p) \, dp &= c \\
-c &> 0 \\
-\therefore p^* = \rho \wedge \rho &> p_\text{min}
-\end{align}
-
-% here we can also look at mvt to prove that the if we fix the c for the agent's cost we show that you cannot have a COI which has its rho equal to the p min which makes the integral vanish.
-c is the search cost per query incurred by the buyer. and $\rho$ is the users reservation price.
-We then prove that:
-
-\begin{theorem}
-\begin{align}
-\lim_{N \to \infty} \text{COI} &= 0 \\
-p_{(1)} &= \min (p_1, p_2, \ldots, p_n) \\
-P(p_{(1)} > p) &= [1-F(p)]^n \\
-\underline{F}(p) &= P(p_{(1)} \leq p) \\
-&= 1 - P(p_{(1)} > p) \\
-&= 1 - [1 - F(p)] \\
-\text{survival functions...} \\
-\mathbb{E}[\underline{F}(p)] &= \underline{p} + \int_{\underline{p}}^{\overline{p}} [1 - F(p)]^n \, dp \\
-\text{COI}: \mathbb{E}[\underline{F}(p)] - \underline{p} \\
-\cdots \\ 
-\int_{\underline{p}}^{\overline{p}} 0 \, dp &= 0 \\
-\end{align}
-% Since F(p) is a CDF, for any p>pmin​, F(p)>0, implying 0≤1−F(p)<1. By the properties of limits, as n→∞, [1−F(p)]n→0 for all p>pmin​.
-%Applying the Lebesgue Dominated Convergence Theorem (since the integrand is bounded by 1 on a finite interval):
-
-From this we can understand that as the number of independent agentic interactions grows to infinity, the cost of information convergest to 0, circumventing the platforms policy $\pi$ and effectively paying only $\underline{p}$. Thus, standard dynamic pricing is not incentive-compatible with agentic traffic.
-The platform's profit margin from "information rent" is entirely eroded.
+\begin{theorem}[COI Erosion in the Limit]
+Let $N$ be the number of independent, utility-maximizing agents querying the platform. Let $p_{(1)}$ be the first order statistic (minimum) of the prices offered to these agents. As $N \to \infty$, the Cost of Information converges to 0.
 \end{theorem}
 
+\begin{proof}
+Let $p_1, \ldots, p_N$ be independent and identically distributed (i.i.d.) price samples drawn from the policy's distribution $F(p)$ with support $[\underline{p}, \bar{p}]$. The realizable price for an optimal searching agent is the first order statistic $p_{(1)} = \min(p_1, \ldots, p_N)$.
+
+The survival function (or reliability function) of the minimum price is given by:
+\begin{equation}
+    S_{p_{(1)}}(t) = P(p_{(1)} > t) = [1 - F(t)]^N
+\end{equation}
+
+To determine the expected value $\mathbb{E}[p_{(1)}]$, we recall the property that for any continuous random variable $X$ with support $[A, B]$, the expectation can be expressed as the lower bound plus the integral of the survival function:
+\begin{equation}
+    \mathbb{E}[X] = A + \int_{A}^{B} P(X > t) \, dt
+\end{equation}
+
+Applying this to our pricing statistic where the lower bound is $\underline{p}$:
+\begin{align}
+    \mathbb{E}[p_{(1)}] &= \underline{p} + \int_{\underline{p}}^{\bar{p}} P(p_{(1)} > t) \, dt \\
+    &= \underline{p} + \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt
+\end{align}
+
+Since $F(t)$ is a valid CDF, for any $t > \underline{p}$, we have strict inequality $F(t) > 0$, implying $0 \le 1 - F(t) < 1$. By the properties of limits, as $N \to \infty$, the term $[1 - F(t)]^N$ converges to 0 pointwise for all $t > \underline{p}$.
+
+Applying the Lebesgue Dominated Convergence Theorem (noting that the integrand is bounded by 1 on the finite interval $[\underline{p}, \bar{p}]$):
+\begin{equation}
+    \lim_{N \to \infty} \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt = \int_{\underline{p}}^{\bar{p}} 0 \, dt = 0
+\end{equation}
+
+Substituting this back into the expression for COI:
+\begin{align}
+    \lim_{N \to \infty} \text{COI} &= \lim_{N \to \infty} (\mathbb{E}[p_{(1)}] - \underline{p}) \\
+    &= \lim_{N \to \infty} \left( (\underline{p} + 0) - \underline{p} \right) \\
+    &= 0
+\end{align}
+\end{proof}
+
+
+This result proves that standard pricing policies $\pi$ fail to extract surplus in the presence of large-scale agentic search, necessitating a robust counter-mechanism.
+
 % The DRO objective creates a lower bound on COI extraction, effectively guaranteeing a minimum margin even in the presence of adversarial agents. we need to prove this and demonstrate that in a theorem.
 
 
@@ -150,78 +147,37 @@ Our approach can be well summarized by a three-stage division, first we intend t
 Study methodology and approach. Data acquisition strategy. Defined objectives and success criteria. Observable metrics and KPIs.
 
 
-\subsection{Discriminative Model Design}
+\subsection{Generative Contamination and Separability}
 
-With data collected from our platform we have a series of observed interactions, with each interaction having a mapping to a specific \texttt{sessionId} and \texttt{experimentId} which allows us to join all components of the experiment design into an information rich feature vector for each session in our observed data. To develop more explicitly the demand estimation, we propose a decomposition of the proxy $\hat{q}_t$ into two latent components:
+To develop a robust pricing agent, we require a simulation environment capable of generating realistic, contaminated interaction data. We achieve this by learning from our Phantom platform data using a two-stage approach.
 
+\subsubsection{GOFAI-Based Separability}
+We employ Good Old-Fashioned AI (GOFAI) heuristics to generate initial weak labels for separability. We define a set of rule-based predicates $\phi_j: \tau \to \{0, 1\}$ (e.g., inter-arrival time consistency, DOM-traversal linearity) to partition the dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$.
+
+\subsubsection{Transition Probability Estimation}
+For both subsets, we model the session dynamics as a Markov Decision Process (MDP) and estimate the transition kernel $\mathcal{T}$. The probability of transitioning to state $s'$ given state $s$ is estimated via maximum likelihood:
 \begin{equation}
-\hat{q}_t = \hat{q}_t^H + \hat{q}_t^A + \epsilon_t
+    \hat{P}(s' \mid s) = \frac{N(s, s')}{\sum_{k \in \mathcal{S}} N(s, k)}
 \end{equation}
+where $N(s, s')$ is the count of observed transitions. This allows us to construct a \textit{Contamination Generator} $\mathcal{G}(\alpha)$. Given a clean trajectory dataset, $\mathcal{G}$ injects synthetic agent trajectories sampled from the learned transition matrix $\hat{P}_A$ until the effective mixing ratio reaches $\alpha$.
 
-Additionally we take into account some degree of random market noise $\epsilon_t$. We can formally define $\hat{q}_t^H$ to be the true signal with conversion intent and the agent component is adversarial noise.
-
-
-
-\subsubsection{Feature Development}
-The schema of our features is developed in \cref{tab:features} which shows the different types of features we produce in order to train our model to understand the origin of the traffic and to which distribution it belongs to. The features can be computed on a rolling basis of each session, for online deployment, however for our purposes it is currently computed uniquely for each \texttt{sessionId} in our historical data.
-
-\input{chapters/feature_table.tex}
-
-The problem we have is constrained by two frontiers, one is extreme (paranoid) detection which includes methods such as CAPTCHA or more mechanical solutions to traffic blocking and detection. % TODO: talk about more methodologies here
-On the other hand, a more lax system without detection (myopic) defines the lower bound of performance for our solution. Our goal is to achieve a Pareto optimal detection system which creates a balance across the dimension of performance as well as a more subjective but none the less important user experience index. To measure our approach to this optimal solution we define a strong evaluation platform to compare our solutions to this learning task. Following the no free lunch theorem we must be prolific in our approach to finding the correct method.
-
-
-\subsection{Dynamic Pricing Algorithm Analysis}
-
-From the perspective of agent contamination, which we define by $\alpha \ in [0,1]$, representing the proportion of traffic generated by agents, the observed signal can be parameterized by this:
+\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
 
+We formulate the pricing problem as a Stackelberg Game where the Platform (Leader) sets prices $p_t$ and the Aggregate Demand (Follower) responds. However, the exact mixing parameter $\alpha$ and the demand distribution shift are non-stationary and unknown in online settings. Relying on a simple error term $\epsilon$ is insufficient. Instead, we adopt a Distributionally Robust Optimization (DRO) objective.
 
+\subsubsection{Ambiguity Set Construction}
+We define an ambiguity set $\mathcal{U}_p(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face:
 \begin{equation}
-\hat{q}_t = (1-\alpha) \cdot \hat{q}_t^H + \alpha \cdot \hat{q}_t^A + \epsilon_t
+    \mathcal{U}_\epsilon(\hat{P}_N) = \left\{ Q \in \mathcal{P}(\Xi) : W_p(Q, \hat{P}_N) \le \epsilon \right\}
 \end{equation}
+This set captures all distributions that are statistically close to our observed training data but allows for adversarial shifts (e.g., sudden bot spikes).
 
-The default assumption of a dynamic pricing algorithm assumes $\alpha = 0$ estimating demand $\hat{D}(p) \approx \mathbb{E}[\hat{q} \vert p]$, whereas in the presence of agents our alpha is a non-zero component. In this case the estimator becomes biased, leading to the emergence of our defined COI.
-
-Deep dive into how the algorithm works, different kinds and justification for chosen approaches + agent impact modeling and quantification.
-
-\subsection{Reinforcement Learning Formulation}
-
-We define our surrogate commercial environment within which we can accurately control for all the variables such as the true demand, providing a clear transparency of the entire system. We start with a product catalogue of size $N$ with random supply initialization per-product. At every step the commercial simulation receives a price vector $p$ according to which we simulate a set of interactions $\tau^\prime$ with a certain proportion $\alpha$ of agents contributing interactions. The interactions serve as a proxy to estimating the true demand $q(p)$ which is composed of two separate demand generators $q_A(p)$ and $q_H(p)$.
-On top of this our gym environment has a built demand estimator callback which is defined individually by each pricing engine. This engine is constructed to interact with the gym environment with the gym environment at each step running a cycle via the commercial environment, creating an observation of all the interactions $\tau^\prime$ and a baseline vector which tells us the ground truth of demand, sales statistic and revenue. The engine is then responsible for learning the pricing policy providing a pricing vector $p_{t+1}$ motivated by a per-episode summary reward composed by.
-
-
-To bridge the experimentally collected data into our simulation we start with turning our interaction data into transition generators which learn the transition probabilities between states (actions performed on the platform) as a markovian decision process, which we can then sample to generate our interaction data underlying the simulation. To account for prices we scale these transition probabilities by a willingness to pay vector which will give us the purchase probability per-product.
-
-We start by defining a willingness to pay $v_{i,j}$ for some product $i$ and theoretical actor $j$ which is the maximum price $p_i$ that the customer would be willing to pay, since we do not have customer specific granularity we sample from a distribution $F_v(x)$ which gives us the proportion of customers willing to pay at most the price $x$, defined by $F_v(x) = P(v \le x)$ which we can use to model the probability of a sale $1 - F_v(x)$ in the base case of 1 product, we can scale this to a full vector which encodes the probabilities of sale which we should use as our baseline for demand which affects the generated interaction data $\tau^\prime$. \cite{Roughgarden2013}
-
-We could then use this to compare the prior and posterior demand to have the delta between the ground truth and estimate where the $p \cdot (1 - F_v(p))$ is equal to the expected revenue and if we observer per-product revenue we can base our revenue loss component of the regret.
-
-
-
+\subsubsection{The Min-Max Objective}
+The robust policy $\pi^*$ is obtained by solving the maximin problem:
 \begin{equation}
-R = \text{revenue} - \text{COI} - \text{UX friction index}
+    \pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}(p) \right]
 \end{equation}
+where $R(p, d)$ is the revenue function and $\lambda$ weighs the penalty for information leakage (COI).
 
-
-As part of our reward engineering we want to take into account the cost of information in our reward with a weight. As seen in most other dynamic pricing systems, regret is most often use to guide the policy development, which in our case serves very well in comparing the ground truth and estimated demand. For us the regret is the revenue loss compared to the oracle which has perfect information access.
-
-\begin{equation}
- \text{Regret}(\pi) = TR(\pi_\text{oracle}) - TR(\pi)
-\end{equation}
-% TR= total revenue
-% Regret is the revenue loss compared to oracle with perfect information:
-
-We also need a regert bound
-
-
-Our pricing engine can be modeled by the mapping:
-
-\begin{equation}
-\pi : \mathbb{R}^N_+ \times \mathcal{H}_t \to \mathbb{R}_+^N
-\end{equation}
-
-where $\mathcal{H}_t$ is the history and state we keep track of, allowing us to define a progression of prices as $p_{t+1} \gets \pi(\hat{q}_t,\mathcal{H}_t)$. With this we can establish that $\tau$ influences $p_{t+1}$ through $\hat{q}_t$
-
-
-How do we define the state space, action space and reward function breakdown and algorithm benchmarking.
-POSSIBLY: Expand into full subsections: 3.6.1 (State-Action Space), 3.6.2 (Reward Design), 3.6.3 (Benchmarking)
+\subsubsection{Actor Implementation}
+In our simulation, the "Follower" is implemented as a set of Actors. Each Actor is initialized with a type $\theta$ which samples a specific demand curve $d(p; \theta)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.