Paper first fillout (#39)

* initial environemnt definitions * high level defintion * formlating the reward simply * improved implementation * tailored docker compose image for secondary tenaordboard * preliminary desriptions and babble * details on formulation and defintion of agent and its loop * typos one * more grammar issues * fluidity improvements and refactors * more decluttering and dnoising * finalizing introduction review * some methodology * somehow this disappeared * bit more of this and that * methodology of how we do architectuer and online DP * fix: compilation * expanding on the taxonomy and economic references * authoer notes * acks + google GCP * making space w new format nada lit review * stronger lit review and more sources * forgot about tables and graphs * dedupe citations * adding cloudflare * fixing env vars * updating docs with url * upating embed * fixing the url * paper badge * formaliztaion of rewards and adding definitions * noisy formulations * connecting some more dots here * adding significant weight in prices * fixing error * fixing typos and consistency * extra math formulations and refferenceot DRO * fixing diagram of loops * github mindmap * fixing erro and thiknig about big picture * enhancing the website * goals methodology and gitignore * some more references and theory links * talking about some wtp * feature: added wordcounter * forcing latex builds and fixining the bib # * refactor: update Cost of Information equations and notation for clarity * some more math and refactors * refactor: unify notation and improve clarity in COI equations * refactor: generalize master function for demand estimation and pricing strategies * we dont like math but we have to do it :( * refactor: enhance Cost of Information framework with additional context and illustration * refactor: enhance literature review and methodology sections with economic theory insights and system architecture details * alining format to fit the rubric * refactoring bibliography * fix: align * mdp additionally * trying different title * adding balance figure * agentic givergence, finally * fix: figure fonts adjusted to match
2026-07-16 01:53:37 +00:00 · 2026-01-13 17:07:29 +01:00
parent 221e71a503
commit a9d73ccce5
24 changed files with 1656 additions and 107 deletions
--- a/paper/src/chapters/01-intro.tex
+++ b/paper/src/chapters/01-intro.tex
@@ -8,9 +8,50 @@

 \section{Introduction}

-Research Objectives and Contribution: What are we making, why and who should care?
+In this paper we present an exploration and defense against the presence of new commercial entities in digitally powered platforms, preserving market equilibrium in the age of AI. This research establishes the following contributions: definition and formalization of non-human transactors in e-commerce platforms, development of a testing-ground for capturing the behavioral essence of these transactors across a large variety of digital systems, construction of a discriminative model (to prove separability) as a strong learner for downstream mitigation of contamination by non-human entities, translation of such learned separability into existing dynamic pricing machine learning loops, and finally establishment of a high-level KPI-affecting causal effect and cost-saving framework for the future of internet commerce in the presence of such non-human learners.
+
+This research effort touches a large variety of domains, spanning behavioral economics for understanding the rationality of behavior as theorized by the concept of homo economicus, agent-based modeling to translate our learned separability into disjoint dynamic pricing systems, reinforcement learning which serves as the SOTA for price-learners, and dynamic pricing and market equilibrium theory to understand the risks of possible supra-competitive pricing phenomena in cases of adversarial pricing systems driving the market out of equilibrium.

 \subsection{Motivation and Market Context}
-Current market dynamics and trends of dynamic pricing and AI agents. Future projections of AI agents. Key stakeholders that are discussing this and reporting on it (Thales). Who is most affected
+
+The current innovation boom in generative artificial intelligence and its applications to knowledge-based work tasks has brought many competing technologies for browser-use automation, with benchmarks and evaluations \cite{xia_evaluation-driven_2025} motivating the development of capabilities focused on commercial research, understanding, and transaction execution \cite{xie_osworld_nodate}. The ``AI Agent'' market is forecasted to grow from around USD 5-8 billion in 2025 to USD 42-52 billion by 2030. This surge reflects adoption in e-commerce, customer service, and enterprise automation, where agents handle interactions previously done by humans, raising the question of how these systems should be designed for future robustness as well as how to maintain a competitive edge in the analytical components of e-commerce platforms \cite{markntel_advisors_global_2025}.
+
+The key stakeholders affected by the threat of increasing agent-driven traffic include online businesses and platform operators (especially in bot-heavy sectors like retail, travel, and financial services), their security, fraud, and engineering teams, end users whose accounts and data are exposed and whose experience degrades, regulators and legal stakeholders responding to breaches and fraud, and the attackers or bot operators driving the automation \cite{imperva_rapid_2025}.
+
+The industry has already seen legal action in cases like Amazon against Perplexity \cite{ghaffary_amazon_nodate}, stemming from the difficulty of identifying traffic from hybrid systems like the Commet browser. This paper explores such systems to better understand what the interaction data looks like and what it means for dynamic pricing and recommendation systems downstream. This observed impact indicates a need for prevention of secondary negative effects on the ``legacy'' systems which power modern revenue sources for many companies. Dynamic pricing algorithms rely on directly translating demand features $q$ to new price assignments $\hat{p}$ across a catalogue of products of size $N$. This opens opportunities to design a \textit{tabula rasa} of digital market mechanisms that will shape the future of commerce in the age of artificial intelligence.
+
 \subsection{Solution Space Overview}
-Different approaches and perspectives, here also add a preview of what will be developed and explored in the lit review.
+Dynamic pricing systems, as presented in \cite{mueller_low-rank_2019}, often deal with sparse low-rank data of demand signals which, combined with contamination from agents, creates complex interactions that impact pricing. To further complicate the problem, certain commercial settings such as the one presented in \cite{amjad_censored_2017} must address the true demand of products under censored observations. This provides a formulation for handling demand in our case with multiple kinds of commercial mediators: $\hat{q} \gets q_A + q_H$ where $q_A$ represents the distribution of demand generated by agentic mediators and $q_H$ represents that of true human demand, these are two distinct populations with divergent objective functions.
+
+We formally define interaction data as coming from some actor which can either be an agent ($A$) or human ($H$). For purposes of this research, an agent is an algorithmic loop with the ability to access a web platform and perform actions such as clicks, scrolls, and input field fills. The loop terminates when the internal large language model judges the provided task definition as complete. A detailed breakdown can be found in \cref{algagent-loop}.
+
+
+\begin{algorithm}[t]
+\DontPrintSemicolon
+
+\SetKwInOut{Input}{Input}
+\SetKwInOut{Output}{Output}
+
+\Input{Goal $G$, Platform URL $u$, LLM $\mathcal{M}$}
+\Output{Task completion result $r$}
+
+Initialize browser instance $\mathcal{B}$ with connection to $u$\;
+Construct prompt $\pi \gets \textsc{BuildPrompt}(G, u)$\;
+$\text{done} \gets \text{False}$\;
+
+\While{$\neg \text{done}$}{
+    Observe current page state $s_t$ from $\mathcal{B}$\;
+    Query $\mathcal{M}$ with $(\pi, s_t)$ to determine next action $a_t \in \{\text{click}, \text{scroll}, \text{fill}, \text{navigate}\}$\;
+    Execute $a_t$ on $\mathcal{B}$ to transition to state $s_{t+1}$\;
+    $\text{done} \gets \mathcal{M}.\textsc{JudgeCompletion}(G, s_{t+1})$\;
+}
+
+Extract final result $r$ from terminal state\;
+\Return{$r$}\;
+
+\caption{AI Agent's Interaction Loop}
+\label{algagent-loop}
+\end{algorithm}
+
+
+The previously described goal of separability allows us to formulate a task which entails taking raw interaction data for either actor and creating a composite demand estimate $\hat{q}$. We propose a robust optimization objective defined in our methodology, transforming the pricing problem into a form of Distributionally Robust Optimization \cite{kuhn_distributionally_2025} where the learner must guard against adversarial contamination in observed demand distributors. In this setting we must learn to make decision that perform under the assumption of not having a single estimated probability distribution but under an ambiguity set of any distribution, of which we have limited information. In our case as stated is a mixture of distributions with a parameter which is unknown and non-stationary.
--- a/paper/src/chapters/02-literature-review.tex
+++ b/paper/src/chapters/02-literature-review.tex
@@ -1,15 +1,44 @@
 \section{Literature Review}

-\subsection{Foundational Concepts}
+To better understand all wedges of the work, we must start by exploring the nature of agents and agentic computer use and web automation, complementing that with economic reasoning and strategic interaction. The final surface to cover, leads us to data-driven dynamic pricing under uncertainty. The key technical risk is not ``agents buying things'' per se, but agents shaping the behavioral and demand signals that downstream pricing systems consume and depend on. The introduction of these mediating actor entities into economic systems, is further creating a threat of false-name bidding \cite{yokoo_effect_2004}, which prior research has explored in a trading context. Other research on pseudonyms in dynamic systems, demonstrate whitewashing in AI agents which can ignore defensive mechanisms by re-entry with different identities \cite{feldman_free-riding_2004}. Dynamic pricing assumes demand proxies are behaviorally meaningful, while bot detection aims at security and access control. The missing bridge is a principled framework for separating non-human reconnaissance from genuine human demand expression and integrating that separation into pricing heuristics without degrading legitimate user experience (in our research tracked by the user-experience index). This gap, is what our contribution aims to address, particularly for the aforementioned stakeholder groups.
+
+\subsection{Agent Taxonomy and Definitions}
+
+An agent in the context of artificial intelligence is generally defined by anything that can reason and act upon observations of its environments (collected through some sensory inputs) and carry out actions through effectors. Moreover, a rational agent is an entity that is capable of perceiving the world around them and taking actions to advance specified goals. This definition by \cite{russell_artificial_nodate} is further developed in an economic context by \cite{parkes_economic_2015}, suggesting AI research attempts to construct a synthetic \textit{homo economicus}, which may also be termed \textit{machina economicus}.
+A specific class or taxon of this \textit{machina economicus}, the Large Language Model (LLM) agent, is defined as an autonomous system capable of achieving goals and adapting post-training, often without needing explicit code or fundamental model changes. \cite{xia_evaluation-driven_2025}
+
+We must however acknowledge the current SOTA as presented by OSWORLD simulations in \cite{xie_osworld_nodate} have demonstrated that multi-modal tasks across desktop and web interaction modes, have a top-performing score of only 12.24\% success, whereas humans have a higher 72\% success rate. This weakness matters for this research because it clarifies the near-term threat model: practical exploitation does not require a fully competent ``computer assistant'', only enough automation to perform high-volume reconnaissance actions (search/filter/open product pages, probe availability/price boundaries) that can contaminate behavioral signals. With the expected growth of these capabilities, this threat only becomes more perilous to revenue management systems.
+
+We model an agent session as producing some events with lower in-session conversion levels relative to humans, this we state in our assumption that $P(\text{purchase} \vert A) \ll P(\text{purchase} \vert H)$ but with a potentially higher volatility in $\hat{q}$, which we observe through the look-to-book metrics in our simulation.
+
+\subsection{Economic Agents: From Homo Economicus to Machina Economicus}
+
+Existing behavioral economic models tend to be criticized for the assumption of rational behavior, as is embodied in the term of homo economicus. The definition of a machina economicus by \cite{parkes_economic_2015} is quite appropriate for our case, particularly because these assumptions of rationality have been argued to be a very adequate reference for AI research by \cite{varian_economic_1995}. For modeling this behavior, the trajectories of these agents can be formally defined to be partially observable Markov decision processes. \cite{xie_osworld_nodate} Agents are however not to be confused with web-bots which have previously been known as automated software applications or scrapers which are set with a purpose of carrying out specific tasks on the internet, without a higher level of internal judgement. \cite{imperva_rapid_2025} In our research, we refer to this actor simply as an Agent belonging to the distribution $A$.
+
+This economic framing also helps separate two related but distinct phenomena of agents as buyers (changing market demand composition), and agents as information gatherers (changing the observed interactions used by pricing/recommendation systems). The thesis focuses on the second, where information acquisition strategically precedes purchase execution. We do not however dismiss the proposed expectation that existing economic systems serving humans, will not be populated by AIs across multiple channels and with various possibly misaligned goals as stated by \cite{parkes_economic_2015}.

-What is the taxonomy and definition of an agent and an actor in this case, a bit more about interaction models in sessions and about dynamic pricing algorithms.

 \subsection{Problem Evidence and Market Impact}
-Documented instances of agent-driven market disruptions - Quantitative evidence of pricing manipulation - Case studies from affected industries

-\subsection{Theoretical Foundations: Economic Prallels}
+The statistical issue of contamination in dynamic pricing systems that observe demand features as a means to update prices has been documented in various previous contexts. The airline industry (which has accounted for 24\% of observed disruptions) has seen malicious activity with a measureable impact on skewing key performance indicators by behavior visible in the look-to-book metrics. Excessive reconnaissance traffic inflates search volume without corresponding completed bookings, thereby skewing demand forecasts and disrupting dynamic pricing models. Demand proxies have also been observed to cause significant threat to inventory management by creating artificial scarcity that distorts the demand-supply relationships in the enterprise model. Censored demand as shown in \cite{amjad_censored_2017} can also be observed in low-bias demand under-estimation caused by a distortion effect coming from non-human traffic data. \cite{imperva_rapid_2025}
+
+When dynamic pricing algorithms operate on highly contaminated or noisy data, the risk grows significantly in creating inaccurate price inferences. The emergent mitigation driven by un-informed reward and regret signals might lead to price suppression for sales continuity which results in harming margins and resulting in a revenue loss. System that poorly fit undesired behavior might result in price gouging, which calls for strong guardrails while preserving targeted business strategy. \cite{mullapudi_reinforcement_nodate}
+
+
+%Documented instances of agent-driven market disruptions - Quantitative evidence of pricing manipulation - Case studies from affected industries
+
+\subsection{Theoretical Foundations: Economic Parallels}
+
+
+
+Early hints of exploration of prices in a standard English auction explored in \cite{varian_economic_1995} which hints at exploration of prices in a sequential manner, which leads to a marginally different cost to the bidder than the reservation price of the seller. This is a setting in which there is no cost incured by the buyer for their actions or exploring prices in the market. They propose that any agent responsable for the pricing of a good must be imune to dynamic strategies which might extract private information from a market. A key take-away which relates to the Vickery auction mechanism (also called a \textit{direct mechanism}) suggests that not only would defenses against such exploitation be necessary, but the construction of a mechanism in which revelation of the true willingness to pay is the dominant strategy for commerce.
+
+Like in classical revenue-maximizing auctions \cite{roughgarden_cs364a_2013} we assume that the human actor in our system has a private valuation $v$ which we formally draw from later defined distributions. The important note here is that the agent proxy does not have a mechanism to convey this private information into the demand data which directly impacts the pricing systems.
+
+% Economic foundations: relating the problem to options pricing theory. Cost of Information (COI) concept and its relevance
+
+% Link Coasean Singularity and other economic market theory and highlight specific information of supra competitive pricing.

-Economic foundations: relating the problem to options pricing theory. Cost of Information (COI) concept and its relevance

 \subsection{Landscape of Existing Work}

--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -1,68 +1,251 @@
 \section{Methodology}

+This section details the theoretical and practical framework developed to address dynamic pricing under the influence of non-human actors. We begin by formalizing the problem environment and the nature of the actors. We then derive the \textit{Cost of Information} (COI) theorem, proving the erosion of pricing power in the limit of agent saturation. Following this, we outline our generative contamination strategy using GOFAI-driven separability and transition probability learning. Finally, we formulate the robust control problem as a Stackelberg game solved via Distributionally Robust Reinforcement Learning (DR-RL) with constructed ambiguity sets.

 \subsection{Problem Formalization}

-Mathematical formalization of agent-induced pricing distortions. Formal definition of potential loss mechanisms $\alpha D$
+We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.

-We consider a business across time during which we have an evolving vector $p_t \in \Re^N$ where $N$ is the number of products in our catalogue. our price vector is directly dependent on a demand function $q_t$ which we define as a linear method of a price elasticity matrix $B_t$. This is the same setup that Microsoft created in their research.
+Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
+\begin{equation}
+e_{s,k} = (a_{s,k}, i_{s,k}, t_{s,k})
+\end{equation}
+where:
+\begin{itemize}
+    \item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
+    \item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
+    \item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
+\end{itemize}

-We gether interaction data from users interacting with a sample platform simulating a hotel/airline which generates interaction distributions $I_t = \{(p_t, q_t^\text{obs}, \pi_t)\}_{t=1}^T$
+The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
+\begin{equation}
+\hat{q}_{t,i} = \sum_{s \in \mathcal{S}_t} \sum_{k=1}^{L_s} \omega(a_{s,k}) \cdot \mathbb{1}[i_{s,k} = i]
+\end{equation}
+where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on their signal strength regarding willingness to pay.
+
+\subsubsection{Actor Types and Demand Curves}
+We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the mixture:
+\begin{equation}
+Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
+\end{equation}
+where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.


-\subsection{Cost of Information Framework}

-Mathematical demonstration and validation of the COI and citation backed evidence, and framework overview + show harm to user via other cost distortions. Maybe split into 3.2.1 (COI Theory) and 3.2.2 (Framework Design)
+\subsection{Cost of Information (COI) Framework}
+
+The \textit{Cost of Information} (COI) represents the markup a pricing policy $\pi$ attempts to extract from the market by leveraging demand signals. We define COI as the expected premium over the minimum viable price $\underline{p}$ (or marginal cost). This also speaks to the financial urgency as a consequence of information asymmetry between the platform and the actors.
+
+\begin{definition}[Cost of Information]
+Let $\pi(\tau)$ be a pricing policy mapping interaction histories to prices. The COI is defined as:
+\begin{align}
+\text{COI} &= \mathbb{E}[P] - \underline{p} \\
+            &= \int_{\underline{p}}^{\bar{p}} (1 - F_\pi(p)) \, dp
+\end{align}
+where $F_\pi(p)$ is the cumulative distribution function of prices generated by $\pi$ under standard operating conditions.
+\end{definition}

-\subsection{System Architecture}
 \begin{figure}[ht]
-\centering
-\begin{tikzpicture}[
-  node distance=1.5cm and 2.5cm,
-  box/.style={rectangle, draw, thick, minimum height=1cm, minimum width=3cm, align=center, fill=blue!10},
-  kafka/.style={rectangle, draw=orange, thick, minimum height=1cm, minimum width=3cm, align=center, fill=orange!15},
-  arrow/.style={thick,->,>=Stealth}
-]
+    \centering
+    \begin{tikzpicture}[scale=1.2]
+        % Define the Gaussian function: centered at 2
+        \def\bellcurve(#1){1.5 * exp(-0.5*((#1-2)/0.6)^2)}

-% Nodes
-\node[box] (webapp) {Web Application \\ (Producer \& Consumer)};
-\node[kafka, below=of webapp] (kafka) {Apache Kafka \\ Cluster};
-\node[box, below=of kafka] (backend) {Backend Services / Microservices \\ (Producers and Consumers)};
+        % Draw the main axis
+        \draw[->, thick] (0, 0) -- (4.5, 0) node[right] {$p$};
+        \draw[->, thick] (0, 0) -- (0, 2) node[above] {Density};

-% Connections
-\draw[arrow] (webapp) to[out=210,in=150] node[above]{Publish} (kafka);
-\draw[arrow] (kafka) to[out=50,in=330] node[below]{Consume} (webapp);
-\draw[arrow] (backend) -- node[above]{Publish/Consume} (kafka);
+        \draw[thick, smooth, samples=100] plot[domain=0:4] (\x, {\bellcurve(\x)});
+        \node at (3.2, 1.2) {$f_\pi(p)$};

-% Optional: Kafka internal components
-%\node[below=0.7cm of kafka, align=center] (topics) {Topics \\ Partitions};
+        % Define p_min and E[p]
+        \def\pmin{0.8}
+        \def\mean{2}

-% Optional background
-\begin{scope}[on background layer]
-  \node[draw, rounded corners, fill=orange!5, fit=(kafka), inner sep=0.3cm] {};
-\end{scope}
-\end{tikzpicture}
-\caption{Technical Diagram}
+        % Vertical lines
+        \draw[dashed] (\pmin, 0) -- (\pmin, 2.0);
+        \draw[dashed] (\mean, 0) -- (\mean, 2.0);
+
+        % Labels on axis
+        \node[below] at (\pmin, 0) {$\underline{p}$};
+        \node[below] at (\mean, 0) {$\mathbb{E}[p]$};
+
+        \draw[<->, thick, red] (\pmin, 2.0) -- (\mean, 2.0) node[midway, above] {COI};
+
+    \end{tikzpicture}
+    \caption{Illustration of the Cost of Information (COI). The COI is defined as the difference between the expected price $\mathbb{E}[p]$ realized by the policy and the minimum viable price $\underline{p}$.}
+    \label{fig:coi_illustration}
 \end{figure}

-High level overview of how it works
+We now formally demonstrate that standard dynamic pricing mechanisms are not incentive-compatible with high-frequency agentic traffic. As the number of independent competitive agents $N$ querying the system grows, the platform's ability to sustain a COI vanishes.
+
+\begin{theorem}[COI Erosion in the Limit]
+Let $N$ be the number of independent, utility-maximizing agents querying the platform. Let $p_{(1)}$ be the first order statistic (minimum) of the prices offered to these agents. As $N \to \infty$, the Cost of Information converges to 0.
+\end{theorem}
+
+\begin{proof}
+Let $p_1, \ldots, p_N$ be independent and identically distributed (i.i.d.) price samples drawn from the policy's distribution $F(p)$ with support $[\underline{p}, \bar{p}]$. The realizable price for an optimal searching agent is the first order statistic $p_{(1)} = \min(p_1, \ldots, p_N)$.
+
+The survival function (or reliability function) of the minimum price is given by:
+\begin{equation}
+S_{p_{(1)}}(t) = P(p_{(1)} > t) = [1 - F(t)]^N
+\end{equation}
+
+To determine the expected value $\mathbb{E}[p_{(1)}]$, we recall the property that for any continuous random variable $X$ with support $[A, B]$, the expectation can be expressed as the lower bound plus the integral of the survival function:
+\begin{equation}
+\mathbb{E}[X] = A + \int_{A}^{B} P(X > t) \, dt
+\end{equation}
+
+Applying this to our pricing statistic where the lower bound is $\underline{p}$:
+\begin{align}
+\mathbb{E}[p_{(1)}] &= \underline{p} + \int_{\underline{p}}^{\bar{p}} P(p_{(1)} > t) \, dt \\
+&= \underline{p} + \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt
+\end{align}
+
+Since $F(t)$ is a valid CDF, for any $t > \underline{p}$, we have strict inequality $F(t) > 0$, implying $0 \le 1 - F(t) < 1$. By the properties of limits, as $N \to \infty$, the term $[1 - F(t)]^N$ converges to 0 pointwise for all $t > \underline{p}$.
+
+Applying the Lebesgue Dominated Convergence Theorem (noting that the integrand is bounded by 1 on the finite interval $[\underline{p}, \bar{p}]$):
+\begin{equation}
+\lim_{N \to \infty} \int_{\underline{p}}^{\bar{p}} [1 - F(t)]^N \, dt = \int_{\underline{p}}^{\bar{p}} 0 \, dt = 0
+\end{equation}
+
+Substituting this back into the expression for COI:
+\begin{align}
+\lim_{N \to \infty} \text{COI} &= \lim_{N \to \infty} (\mathbb{E}[p_{(1)}] - \underline{p}) \\
+&= \lim_{N \to \infty} \left( (\underline{p} + 0) - \underline{p} \right) \\
+&= 0
+\end{align}
+\end{proof}
+
+
+This result proves that standard pricing policies $\pi$ fail to extract surplus in the presence of large-scale agentic search, necessitating a robust counter-mechanism.
+
+% The DRO objective creates a lower bound on COI extraction, effectively guaranteeing a minimum margin even in the presence of adversarial agents. we need to prove this and demonstrate that in a theorem.
+
+
+%Mathematical demonstration and validation of the COI and citation backed evidence, and framework overview + show harm to user via other cost distortions. Maybe split into 3.2.1 (COI Theory) and 3.2.2 (Framework Design)
+
+\subsection{System Architecture: Hybrid Kappa-Lambda Architecture}
+
+In order for our research to have grounding in interactions we built a robust e-commerce web-platform. We initially conducted a survey of the leading platforms of airlines and hotel booking sites to identify the specific interface patterns that effectively manage complex travel data. Our analysis revealed a clear industry standard: while both sectors rely on tabbed service selection and left-sidebar filtering to streamline navigation, they diverge in result presentation: airlines utilize visual date-price bars and multi-step wizards to optimize for logistical transparency, whereas hotel platforms leverage image-led cards and scarcity triggers to drive emotional engagement and urgency. Our web framework defines a highly agnostic boilerplate which can be seeded with any data-modality with an easy-to-tailor pattern, which we leverage to define a \texttt{hotel} and \texttt{airline} mode. Both modes are then individually deployed via an environment level argument which adjusts the proxy routing with a custom middleware inside next.js to render only the desired mode. The purpose of this was to create a baseline adaptable to any use-case or desired commercial application.
+
+
+The architecture of this platform begins with the deployed web-apps posting interaction data to our backend which processes them and stores each ingested interaction into a kafka cluster. This serves as our data reservoir tracking and associating each interaction with its session and importantly with which experiment it belongs to. Not only do we track the behavioral interactions, but our pricing provider micro-service, once called by the frontend reports the observed/queried price-product into kafka. This kafka cluster is subscribed to by our pipeline which is configured on a schedule in Airflow, with the possibility of manual trigger. The final stage of the pricing pipeline, submits computed dynamic pricing results into a redis database for quick updates which is then read by the pricing provider and displayed on the webapp. This is a very generic end-to-end mechanism which is applicable to a variety of different e-commerce tasks. We intentionally put emphasis on the development of this infrastructure to establish a reproducible framework for interaction and to minimize any noise.
+
+
+\subsubsection{DevOps Principles}
+
+\subsubsection{Online Dynamic Pricing}
+
+The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$. The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
+
+\begin{equation}
+\hat{p}_i = \begin{cases}
+p_{0,i} \cdot \lambda_{\text{surge}} & \text{if } \hat{q}_i \geq \theta_{\text{high}} \\
+p_{0,i} \cdot \lambda_{\text{disc}} & \text{if } \hat{q}_i \leq \theta_{\text{low}} \\
+p_{0,i} & \text{otherwise}
+\end{cases}
+\quad \forall i \in \{1, \ldots, N\}
+\end{equation}
+
+where $p_0 \in \mathbb{R}^N$ is the base price vector (which is seeded into our database distinctly for each mode of the commerce platform), $\theta_{\text{high}}, \theta_{\text{low}} \in \mathbb{R}$ are demand thresholds defining surge and discount regions, and $\lambda_{\text{surge}}, \lambda_{\text{disc}} \in \mathbb{R}^+$ are multiplicative factors with typical values $\lambda_{\text{surge}} = 1.2$ and $\lambda_{\text{disc}} = 0.9$. This piecewise function enables rapid price adjustment in response to observed demand without requiring complex elasticity estimation or historical calibration, allowing us to expose actors within our experiments to a system with a dynamic component of pricing.
+
+We will for our offilne experimental intents generalize a master function for encompasing distinct demand estimation and pricing strategies.
+
+\begin{align}
+V(\cdot) = \max_{p_t} \min_{Q \in \mathcal{U}(\hat{d})}{\mathbb{E}_{d\sim Q} [p_t \times d(p_t, x_t ; \theta) + \psi V_{t+1}(\cdot)]}
+\end{align}
+
+We follow differnet substitutouns which will server as hyperparameters later on.
+
 \subsection{Experimental Design}
-Study methodology and approach. Data acquisition strategy. Defined objectives and success criteria. Observable metrics and KPIs

-\subsection{Dynamic Pricing Algorithm Analysis}
-Deep dive into how the algorithm works, different kinds and justification for chosen appraoches + agent impact modeling and quantification.
-\subsection{Reinforcement Learning Formulation}
-How do we define the state space, action space and reward function breakdown and algorithm benchmarking.
-POSSIBLY: Expand into full subsections: 3.6.1 (State-Action Space), 3.6.2 (Reward Design), 3.6.3 (Benchmarking)
+The experimentation begins with the design of goals, with careful consideration to assure a uniform spanning across different variables within each product-architecture of either the hotel or airline platforms. Our crafted collection of goals (jobs to be done) is then tracked in a postgress database with one table to track goals and another table to track different experiment runs, and their associated goals in a experiment-goal one-to-one relationship.
+
+The purpose of this effort to gather data on interactions, is the first half of our research. With this collected data on behavioral characteristics, enhanced by our feature augmentation, we can create distribution separation into two bins $y \in \{A,H\}$ with a certain probability $p$ dependent on the session-specific features. To address the second loop of our system, we use this gained capability of discrimination to enhance the learner design involved in our surrogate dynamic pricing task which simulates an independent dynamic pricing scenario under which we can train a more controlled policy with the ability to account for true demand signals under conditions of contamination from non-human actors.


-\begin{algorithm}[t]
-\DontPrintSemicolon
-\KwIn{stepsize $\eta$, smoothing $\delta$, rank $d$}
-\For{$t=1$ \KwTo $T$}{
-  Sample $u_t$ on unit sphere; set $x_t^\prime=x_t+\delta u_t$\;
-  Set $p_t \gets U x_t^\prime$ and observe $q_t, R_t(p_t)$\;
-  $x_{t+1} \gets \Pi\_{\mathcal{X}}(x_t-\eta R_t(p_t) u_t)$\;
-}
-\caption{Online Pricing Optimization (template)}
-\end{algorithm}
+Our approach can be well summarized by a three-stage division, first we intend to observe and \textit{vectorize} the behavioral interaction data from our experiments, we then develop the separability which helps us deepen the semantic understanding of the behavioral patterns. Finally we use our newly gained learner to leverage a defensive mechanism within the simulation stage of a controlled dynamic pricing loop.
+
+\begin{figure}[ht]
+  \resizebox{\columnwidth}{!}{%
+    \input{chapters/loop_figure.tex}
+  }
+  \caption{Overview of the Dynamic Pricing Tasks.}
+\end{figure}
+
+
+Study methodology and approach. Data acquisition strategy. Defined objectives and success criteria. Observable metrics and KPIs.
+
+
+\subsection{Generative Contamination and Separability}
+
+To develop a robust pricing agent, we require a simulation environment capable of generating realistic, contaminated interaction data. We achieve this by learning from our Phantom platform data using a two-stage approach.
+
+
+
+\subsubsection{GOFAI-Based Separability}
+We employ Good Old-Fashioned AI (GOFAI) heuristics to generate initial weak labels for separability. We define a set of rule-based predicates $\phi_j: \tau \to \{0, 1\}$ to partition the dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$. We construct distinct MDPs per each behavioral profile of humans and agents and from those we establish $D_{KL}$. From initial findings we compute a KL divergence of $\approx 2.0236$ across transition probabilities between states which can be seen in \ref{fig:human_mdp_viz} and \ref{fig:agent_mdp_viz}.
+
+\begin{figure}[ht]
+    \centering
+    \includegraphics[width=0.8\textwidth]{chapters/mdp_human.pdf}
+    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for human actions.}
+    \label{fig:human_mdp_viz}
+\end{figure}
+
+\begin{figure}[ht]
+    \centering
+    \includegraphics[width=0.8\textwidth]{chapters/mdp_agent.pdf}
+    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{agent} behavior profiles. The state space and transition probabilities are learned from observed session trajectories to enable generative contamination.}
+    \label{fig:agent_mdp_viz}
+  \end{figure}
+
+\subsubsection{Transition Probability Estimation}
+For both subsets, we model the session dynamics as a Markov Decision Process (MDP) and estimate the transition kernel $\mathcal{T}$. The probability of transitioning to state $s'$ given state $s$ is estimated via maximum likelihood:
+\begin{equation}
+    \hat{P}(s' \mid s) = \frac{N(s, s')}{\sum_{k \in \mathcal{S}} N(s, k)}
+\end{equation}
+where $N(s, s')$ is the count of observed transitions. This allows us to construct a \textit{Contamination Generator} $\mathcal{G}(\alpha)$. Given a clean trajectory dataset, $\mathcal{G}$ injects synthetic agent trajectories sampled from the learned transition matrix $\hat{P}_A$ until the effective mixing ratio reaches $\alpha$.
+
+\subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
+
+We formulate the pricing problem as a Stackelberg Game where the Platform (Leader) sets prices $p_t$ and the Aggregate Demand (Follower) responds. However, the exact mixing parameter $\alpha$ and the demand distribution shift are non-stationary and unknown in online settings. Relying on a simple error term $\epsilon$ is insufficient. Instead, we adopt a Distributionally Robust Optimization (DRO) objective.
+
+\subsubsection{Ambiguity Set Construction}
+We define an ambiguity set $\mathcal{U}_p(\hat{P}_N)$ centered around our empirical reference distribution $\hat{P}_N$ (derived from the generator $\mathcal{G}$). We utilize the Wasserstein distance metric to define the set of plausible demand distributions the agent might face:
+\begin{equation}
+\mathcal{U}_\epsilon(\hat{P}_N) = \left\{ Q \in \mathcal{P}(\Xi) : W_p(Q, \hat{P}_N) \le \epsilon \right\}
+\end{equation}
+This set captures all distributions that are statistically close to our observed training data but allows for adversarial shifts (e.g., sudden bot spikes).
+
+\subsubsection{The Min-Max Objective}
+The robust policy $\pi^*$ is obtained by solving the maximin problem:
+\begin{equation}
+\pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}(p) \right]
+\end{equation}
+where $R(p, d)$ is the revenue function and $\lambda$ weighs the penalty for information leakage (COI).
+
+\subsubsection{Actor Implementation}
+In our simulation, the "Follower" is implemented as a set of Actors. Each Actor is initialized with a type $\theta$ which samples a specific demand curve $d(p; \theta)$ from the latent distribution. This formalization ensures that our DR-RL agent does not overfit to a single deterministic demand function but learns a policy robust to the distributional uncertainty defined by $\mathcal{U}_\epsilon$.
+
+
+As part of our reward engineering we think about the UX factor ($UX \in [0,1]$) whic his our proxy for user experience degradation, this is computed as a mixture of contribution from the separability model metric of $\frac{1}{\text{Specificity}}$.
+
+\begin{figure}[ht]
+  \centering
+  \resizebox{0.5\columnwidth}{!}{%
+    \input{chapters/balance_figure.tex}
+  }
+  \caption{Introducing the UX index allows us to better distinguish the kind of impact different methods have and allows us to compare them on this Pareto-like scale.}
+\end{figure}
+
+We also need to think about a policy like taxation to the agents Strategy-Proof Mechanism Design, specifically the Vickrey-Clarke-Groves (VCG) payment rule. We link and prove that this would create an incentive for the dominant strategy to become truth-telling.
+
+\section{Heuristics as part of neuro-inspired steering systems}
+
+Steve Burns, superior culliculus (face heuristics) we create this sort of part of the 'brain' + amortized inference.
+
+We could say that a DQN for example is the learnin subsystem and then within our reward mechanism or some other computational method we introduce a steering subsystem which acts as the proposed ``pricing heuristic'' against the given non human transaction data.
+
+\section{Market construction}
--- a/paper/src/chapters/05-discussion.tex
+++ b/paper/src/chapters/05-discussion.tex
@@ -1,5 +1,15 @@
 \section{Discussion}

+\subsection{Transition to Agentic Market Microstructure}
+
+Our analysis of the interaction dynamics between the platform and non-human actors suggests that the current static pricing models are insufficient for an agent-mediated economy. If we assume a transition toward a direct revelation mechanism, where actors must reveal their true valuation of a good through bidding dynamics, we inevitably introduce significant stochasticity into the pricing system. Unlike traditional e-commerce where prices are relatively sticky, such a mechanism implies a high volatility characteristic of financial equity markets (without the fungability however).
+
+However, ecommerce commodities differ fundamentally from financial securities: they possess a hard floor defined by unit economics and reservation prices. The market might react enthusiastically to an iPhone priced at \$1, such a transaction is not permissible. The platform must establish an initial valuation anchor ($P_{0}$) defined by the marginal cost plus a target margin, around which the market price is permitted to fluctuate. We propose the introduction of GenAI Agents as Institutional Market Makers.
+
+This is also under the assumption of expected transactional capabilities being given to AI Agents.
+
+
+
 \subsection{Risk Assessment and Limitations}

 Acknowledge risks and constraints and data sizes.
--- a/paper/src/chapters/06-conclusion.tex
+++ b/paper/src/chapters/06-conclusion.tex
@@ -1,6 +1,6 @@
 \section{Conclusion}

-\subsection{Summary of contributions }
+\subsection{Summary of contributions}
 Restate the thesis and key findings with validation of research objectives.

 \subsection{Future Works and Next Steps}
--- a/paper/src/chapters/balance_figure.tex
+++ b/paper/src/chapters/balance_figure.tex
@@ -0,0 +1,38 @@
+
+\begin{tikzpicture}[
+    % Styles for consistency
+    axis/.style={->, >=Stealth, line width=1.2pt, color=black!85},
+    curve/.style={color=black, line width=2.5pt},
+    point/.style={circle, fill=black, inner sep=0pt, minimum size=6pt},
+    label_text/.style={font=\large, align=center, color=black},
+    annotation_line/.style={thick, -, color=black!60}
+]
+
+    % Define Radius
+    \def\R{5}
+
+    % Draw Axes
+    % Extended slightly beyond radius (\R + 1)
+    \draw[axis] (0,0) -- (\R+1.5,0) node[midway, below=10pt, font=\bfseries\large] {UX Index};
+    \draw[axis] (0,0) -- (0,\R+1.5) node[midway, left=15pt, rotate=90, font=\bfseries\large] {Performance};
+
+    % Draw Perfect 1/4 Circle
+    % Syntax: arc (start_angle : end_angle : radius)
+    \draw[curve] (0,\R) arc (90:0:\R);
+
+    % 1. Paranoid (High Performance side) -> Angle 67.5 degrees
+    \node[point] (p1) at (75:\R) {};
+    \node[label_text, above right=0.1cm and 0.1cm of p1] (l1) {Paranoid};
+    \draw[annotation_line] (l1) -- (p1);
+
+    % 2. Perfect Detection (Exact Middle) -> Angle 45 degrees
+    \node[point] (p2) at (45:\R) {};
+    \node[label_text, above right=0.2cm and 0.2cm of p2] (l2) {Perfect Detection};
+    \draw[annotation_line] (l2) -- (p2);
+
+    % 3. No Detection (High UX side) -> Angle 22.5 degrees
+    \node[point] (p3) at (15:\R) {};
+    \node[label_text, right=0.5cm of p3] (l3) {No Detection};
+    \draw[annotation_line] (l3) -- (p3);
+
+\end{tikzpicture}
--- a/paper/src/chapters/feature_table.tex
+++ b/paper/src/chapters/feature_table.tex
@@ -0,0 +1,65 @@
+\begin{table}[ht]
+\centering
+\small
+\resizebox{\columnwidth}{!}{%
+\begin{tabular}{p{4.5cm}p{1.5cm}p{6cm}}
+\hline
+\textbf{Feature} & \textbf{Type} & \textbf{Description} \\
+\hline
+\multicolumn{3}{l}{\textit{Session Identifiers}} \\
+sessionId & object & Unique identifier for user session \\
+experimentId & object & Experiment run identifier \\
+\hline
+\multicolumn{3}{l}{\textit{Temporal Features}} \\
+session\_duration\_sec & float & Total session duration in seconds \\
+avg\_time\_between\_events & float & Mean inter-event time \\
+std\_time\_between\_events & float & Standard deviation of inter-event times \\
+min\_time\_between\_events & float & Minimum time between consecutive events \\
+session\_start\_hour & int & Hour of day when session started \\
+\hline
+\multicolumn{3}{l}{\textit{Interaction Metrics}} \\
+total\_interactions & int & Count of all user interactions \\
+total\_events & int & Total number of tracked events \\
+interaction\_velocity & float & Rate of interactions per time unit \\
+max\_velocity\_5min & int & Peak interaction count in any 5-minute window \\
+\hline
+\multicolumn{3}{l}{\textit{Navigation Behavior}} \\
+unique\_pages & int & Number of distinct pages visited \\
+page\_views & int & Total page view events \\
+\hline
+\multicolumn{3}{l}{\textit{Product Engagement}} \\
+item\_views & int & Number of product detail views \\
+unique\_products\_viewed & int & Count of distinct products examined \\
+product\_view\_depth & int & Repeat views of same products \\
+\hline
+\multicolumn{3}{l}{\textit{Conversion Funnel}} \\
+cart\_adds & int & Number of items added to cart \\
+purchases & int & Completed transactions \\
+cart\_to\_view\_ratio & float & Ratio of cart additions to item views \\
+conversion\_rate & float & Purchase to view conversion \\
+\hline
+\multicolumn{3}{l}{\textit{Interaction Quality}} \\
+hover\_events & int & Mouse hover event count \\
+hover\_intensity & float & Hover events per interaction \\
+\hline
+\multicolumn{3}{l}{\textit{Price Behavior}} \\
+avg\_price\_seen & float & Mean price across viewed products \\
+min\_price\_seen & float & Lowest price encountered \\
+max\_price\_seen & float & Highest price encountered \\
+price\_range & float & Difference between max and min prices seen \\
+\hline
+\multicolumn{3}{l}{\textit{Technical Fingerprinting}} \\
+is\_headless & bool & Headless browser detection flag \\
+is\_automation & bool & Automation framework detection flag \\
+browser\_family & object & Browser type classification \\
+\hline
+\multicolumn{3}{l}{\textit{Experimental Labels}} \\
+is\_agent & bool & Ground truth agent classification \\
+xp\_human\_only & bool & Human-only experiment indicator \\
+xp\_market\_mode & object & Market context (hotel/airline) \\
+\hline
+\end{tabular}%
+}
+\caption{Feature matrix schema for session-level behavioral classification (32 features total).}
+\label{tab:features}
+\end{table}
--- a/paper/src/chapters/loop_figure.tex
+++ b/paper/src/chapters/loop_figure.tex
@@ -0,0 +1,110 @@
+\definecolor{mygreenfill}{RGB}{169, 234, 186}
+\definecolor{mygreenborder}{RGB}{29, 145, 61}
+\definecolor{mybluefill}{RGB}{204, 222, 255}
+\definecolor{myblueborder}{RGB}{66, 106, 189}
+\definecolor{mygray}{RGB}{150, 150, 150}
+
+
+
+\begin{tikzpicture}[
+    node distance=2cm,
+    % Style for Green Nodes
+    greenbox/.style={
+        rectangle,
+        draw=mygreenborder,
+        fill=mygreenfill,
+        line width=1.2pt,
+        align=center,
+        minimum height=1cm
+    },
+    % Style for Blue Nodes
+    bluebox/.style={
+        rectangle,
+        draw=myblueborder,
+        fill=mybluefill,
+        line width=1.2pt,
+        align=center,
+        minimum height=1cm
+    },
+    % Style for Arrows
+    myarrow/.style={
+        ->,
+        >={Stealth[length=3mm, width=2mm]},
+        draw=black!80,
+        line width=1.2pt,
+        rounded corners=5pt
+    },
+    % Style for Background Dashed Circles
+    dashedloop/.style={
+        dashed,
+        draw=mygray,
+        line width=1pt
+    }
+]
+
+    % --- Coordinate Layout ---
+    % Defining a grid relative to the center
+
+    % Left Loop (Green) Nodes
+    \node[greenbox, minimum width=3.5cm] (commerce) at (-3.5, 2) {Commerce Experiment};
+    \node[greenbox, minimum width=1.5cm] (raw) at (-6.5, 0) {Raw\\Logs};
+    \node[greenbox, minimum width=1.5cm] (features) at (-4, -2.5) {Features};
+    \node[greenbox, minimum width=2.5cm] (classification) at (-1, -0.5) {Classification\\Training A/H};
+
+    % Right Loop (Blue) Nodes
+    \node[bluebox, minimum width=2.5cm] (trainedpricing) at (3.2, 2) {Trained Pricing};
+    \node[bluebox, minimum width=2.5cm] (policy) at (6.5, 0) {Trained Pricing\\Policy};
+    \node[bluebox, minimum width=2.5cm] (rlgym) at (3.2, -2.2) {RL Gym\\Training};
+
+    % --- Background Dashed Loops ---
+    \begin{scope}[on background layer]
+        % Left Loop Circle
+        \draw[dashedloop] (-3.5, 0) ellipse (3.5cm and 2.8cm);
+        % Right Loop Circle
+        \draw[dashedloop] (3.5, 0) ellipse (3.5cm and 2.8cm);
+    \end{scope}
+
+    % --- Arrows: Loop One (Green) ---
+    % Commerce -> Raw Logs
+    \draw[myarrow] (commerce.west) to[out=180, in=90] (raw.north);
+
+    % Raw Logs -> Features
+    \draw[myarrow] (raw.south) to[out=270, in=180] (features.west);
+
+    % Features -> Classification
+    \draw[myarrow] (features.east) to[out=0, in=250] (classification.south);
+
+    % Classification -> Commerce (Closing the loop)
+    \draw[myarrow] (classification.north) to[out=110, in=0] (commerce.east);
+
+    % --- Arrows: Loop Two (Blue) ---
+    % Classification (Green) -> RL Gym (Blue) - Crossing over
+    \draw[myarrow] (classification.east) to[out=0, in=180] (rlgym.west);
+
+    % RL Gym -> Policy
+    \draw[myarrow] (rlgym.east) to[out=0, in=270] (policy.south);
+
+    % Policy -> Trained Pricing
+    \draw[myarrow] (policy.north) to[out=90, in=0] (trainedpricing.east);
+
+    % Trained Pricing -> Commerce (Crossing back)
+    \draw[myarrow] (trainedpricing.west) -- node[above, font=\small, yshift=2pt] {New Pricing} (commerce.east);
+
+    % --- Text Labels ---
+
+    % Loop One Label
+    \node[align=center] at (-3.8, 0) {Loop One:\\Data \textit{(Online)}};
+
+    % Loop Two Label
+    \node[align=center] at (3.5, 0) {Loop Two:\\Defense Gym \textit{(Offline)}};
+
+    % Bottom Legend
+    \node[font=\small] (taskA) at (-4, -4) {Dynamic Pricing Task A};
+    \node[font=\small] (taskB) at (4, -4) {Dynamic Pricing Task B};
+    \node[font=\small] (indep) at (0, -4) {Independent};
+
+    % Arrows for bottom legend
+    \draw[->, >=Stealth, thick, darkgray] (indep.west) -- (taskA.east);
+    \draw[->, >=Stealth, thick, darkgray] (indep.east) -- (taskB.west);
+
+\end{tikzpicture}
--- a/paper/src/chapters/mdp_agent.pdf
+++ b/paper/src/chapters/mdp_agent.pdf
--- a/paper/src/chapters/mdp_human.pdf
+++ b/paper/src/chapters/mdp_human.pdf