From 4b89b64674d356f6a294f05dcbe03ac6b2d7ed13 Mon Sep 17 00:00:00 2001
From: Daniel Rosel <daniel@alves.world>
Date: Sun, 8 Mar 2026 13:27:17 +0100
Subject: [PATCH] monestary updates

---
 paper/src/chapters/01-intro.tex       |  2 +-
 paper/src/chapters/03-methodology.tex | 61 ++++++++++++++++++++++-----
 paper/src/chapters/04-results.tex     |  8 +++-
 paper/src/chapters/05-discussion.tex  | 10 +++--
 paper/src/chapters/06-conclusion.tex  |  7 ++-
 paper/src/main.tex                    |  7 +++
 6 files changed, 76 insertions(+), 19 deletions(-)

diff --git a/paper/src/chapters/01-intro.tex b/paper/src/chapters/01-intro.tex
index bd70de4..79e5f73 100644
--- a/paper/src/chapters/01-intro.tex
+++ b/paper/src/chapters/01-intro.tex
@@ -10,7 +10,7 @@
 
 In this paper we present an exploration and defense against the presence of new commercial entities in digitally powered platforms, preserving market equilibrium in the age of AI. This research establishes the following contributions: definition and formalization of non-human transactors in e-commerce platforms, development of a testing-ground for capturing the behavioral essence of these transactors across a large variety of digital systems, construction of a discriminative model (to prove separability) as a strong learner for downstream mitigation of contamination by non-human entities, translation of such learned separability into existing dynamic pricing machine learning loops, and finally establishment of a high-level KPI-affecting causal effect and cost-saving framework for the future of internet commerce in the presence of such non-human learners.
 
-This research effort touches a large variety of domains, spanning behavioral economics for understanding the rationality of behavior as theorized by the concept of homo economicus, agent-based modeling to translate our learned separability into disjoint dynamic pricing systems, reinforcement learning which serves as the SOTA for price-learners, and dynamic pricing and market equilibrium theory to understand the risks of possible supra-competitive pricing phenomena in cases of adversarial pricing systems driving the market out of equilibrium. \footnote{Given the rapid evolution of the field we acknowledge all developments with a cutoff set at the date of March 31st 2026.}
+This research effort touches a large variety of domains, spanning behavioral economics for understanding the rationality of behavior as theorized by the concept of homo economicus, agent-based modeling to translate our learned separability into disjoint dynamic pricing systems, reinforcement learning which serves as the SOTA for price-learners, and dynamic pricing and market equilibrium theory to understand the risks of possible supra-competitive pricing phenomena in cases of adversarial pricing systems driving the market out of equilibrium. \footnote{Given the rapid evolution of the field we acknowledge all developments with a cutoff set at the date of March 1st 2026.}
 
 \subsection{Motivation and Market Context}
 
diff --git a/paper/src/chapters/03-methodology.tex b/paper/src/chapters/03-methodology.tex
index fde3364..f667e5f 100644
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -7,7 +7,7 @@ This section details the theoretical and practical framework developed to addres
 
 \subsection{Problem Formalization}
 
-We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $Y_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
+We define a commercial environment where the platform interacts with a stream of sessions. Let $\mathcal{S}$ denote the set of all sessions. Each session $s \in \mathcal{S}$ is generated by an actor belonging to a latent class $\theta_s \in \{H, A\}$, where $H$ denotes Human and $A$ denotes Agent.
 
 Each session produces a trajectory of observable events $\tau_s = (e_{s,1}, \ldots, e_{s,L_s})$. An event $e_{s,k}$ is a tuple defined as:
 \begin{equation}
@@ -18,7 +18,7 @@ where:
     \item $a_{s,k} \in \mathcal{A}$ is the action taken (e.g., \texttt{view\_item}, \texttt{add\_to\_cart}).
     \item $i_{s,k} \in \{1, \ldots, N\}$ is the target item index.
     \item $t_{s,k} \in \mathbb{R}_+$ is the continuous timestamp.
-\end{itemize}
+\end{itemize}}
 
 The platform does not directly observe the true underlying demand function $d(p)$. Instead, it observes a behavioral proxy $\hat{q}_t$, which is a composite signal derived from the mixture of actor types. We define the demand proxy for product $i$ at epoch $t$ as a weighted aggregation of events:
 \begin{equation}
@@ -148,7 +148,10 @@ Reproducible results are key to quality research platforms, this is taken into m
 \subsubsection{Online Dynamic Pricing}
 
 In order to collect data from actors under correct conditions we replicate a naive and simple dynamic pricing algorithm which runs in the background during the experiments.
-The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$. The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
+The dynamic pricing done is handled by a pipeline which computes a demand estimate on a per-product basis of a specific window of the data, defined by the period $T$ which by default is 5 minutes. This dynamic pricing pipeline computes a demand estimate vector $\hat{q} \in \mathbb{R}^N$ by a weighted sum of interactions for each product, it additionally computes a price elasticity vector $\hat{\epsilon}$ in the same dimensions as our demand. The final features matrix is of the size $N \times 2$ which we translate to a new price vector $\hat{p} \in \mathbb{R}^N$.
+
+
+The transformation that governs this dynamic pricing is a very simple surge-based pricing (a special case of our later defined policy $\pi$):
 
 \begin{equation}
 \hat{p}_i = \begin{cases}
@@ -183,7 +186,7 @@ The human data collection involved 18 participants, all of whom provided explici
 
 To evaluate quality and realism of the setup, we store both structured event logs and full interaction transcripts. This lets us combine quantitative analysis with transcript-level qualitative findings. The result is an isolated system where we can control the interaction process while preserving realistic behavior.
 
-Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to separate classes $y \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
+Operationally, goals and experiment runs are tracked in PostgreSQL (goal table, run table, and assignment mapping). This data-acquisition phase is the first half of the methodology and is intentionally a disconnected component that feeds the later contributions. The second half uses collected behavioral traces to separate classes $\theta \in \{A,H\}$ with session-conditioned probability estimates, then injects those estimates into the pricing learner.
 
 Our process follows three stages: (1) observe and \textit{vectorize} behavioral interactions, (2) learn separability to characterize human versus agent patterns, and (3) use the learned signal to train a defensive policy in a controlled dynamic-pricing simulator.
 
@@ -207,6 +210,7 @@ The simulator has multiple configurable factors. We design a multi-factor study
 % Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n=18 per group. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
 
+% TODO: cite in the apendix the math to get to 160 petaflops of compute
 Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160 PFLOPS of aggregate compute, which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
 
 \begin{table}[ht]
@@ -281,7 +285,7 @@ $\mathcal{A}_{\text{filter}}$ & \texttt{search}, \texttt{filter\_date}, \texttt{
 \end{table}
 
 This partition enables the weight function $\omega$ from Eq.~\ref{eq:qhat} to assign category-specific signal strengths, with $\omega(\mathcal{A}_{\text{cart}}) > \omega(\mathcal{A}_{\text{dwell}}) > \omega(\mathcal{A}_{\text{nav}}) > \omega(\mathcal{A}_{\text{filter}})$ reflecting decreasing commitment.
-
+Its important to acknowledge that this creates a very blatant assumption in the weighting, we do motivate the scale of each weight by the per-category observed divergence between each behavioral profile.
 In the simulator baseline this order is encoded with a compact fixed scale: cart $=4.0$, dwell $=2.0$, nav $=1.0$, filter $=0.5$. Unknown actions are mapped by prefix heuristics to the nearest category.
 
 The metadata record $\mu$ varies by action type. For product views, $\mu$ contains the observed price $p_{\text{obs}}$ and product attributes. For dwell events, $\mu$ includes the element text and accumulated hover duration. This heterogeneous structure is captured via a schema-on-read approach in our Kafka ingestion pipeline, where events are validated against type-specific schemas before storage.
@@ -289,13 +293,15 @@ The metadata record $\mu$ varies by action type. For product views, $\mu$ contai
 In addition to behavioral events, the platform logs price observations to a separate Kafka topic. Each price query generates a record $(i, p, \text{sid}, \phi, t)$ associating the product, displayed price, requesting session, platform mode, and timestamp. This dual-stream architecture enables joint analysis of price exposure and behavioral response.
 
 
+
+
 \subsection{Generative Contamination and Separability}
 
 To train a robust pricing learner, we need a simulator that can generate realistic interaction data under controlled contamination. We build this from Phantom data using a two-stage approach.
 
 
 \subsubsection{Ground-Truth Separability}
-Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $y_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?
+Because sessions are collected under controlled experimental conditions where each actor is assigned a known type at the start of the trial, labels $\theta_s \in \{H, A\}$ are available as ground truth rather than as the output of a heuristic classifier. We therefore estimate separate transition kernels directly from each labeled partition $\mathcal{D}_H$ and $\mathcal{D}_A$, treating the resulting $\hat{\mathcal{T}}_H$ and $\hat{\mathcal{T}}_A$ as the ground-truth behavioral profiles for each class. We then ask a direct methodological question: are the kernels separable enough to justify downstream pricing control that depends on that separability?
 
 To answer this, we compute average KL divergence between transition probability matrices. This statistic gives global separability and event-level diagnostics at the same time. To test whether the observed between-class value exceeds finite-sample estimation noise, we compute an intra-class bootstrap baseline by repeatedly splitting $\mathcal{D}_H$ and $\mathcal{D}_A$ into two random halves, fitting a transition kernel on each half, and re-computing the same average KL statistic for each split.
 
@@ -303,7 +309,7 @@ Formally, for $B$ bootstrap splits per class we obtain reference samples $\{d_{H
 \begin{equation}
 \hat p = \frac{1 + \sum_{j=1}^{2B}\mathbf{1}\{d_j^{\text{intra}} \ge d^{\text{inter}}\}}{2B + 1},
 \end{equation}
-which gives a direct significance check for separability before using divergence-derived control signals in pricing.
+which gives a direct significance check for separability before using divergence-derived centroid control signals in pricing.
 
 \begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
 Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
@@ -346,9 +352,6 @@ To scale this to catalog-level pricing, we expand the base event transition matr
   \end{figure}
 
 
-\subsection{Second-Stage Classification}
-After contamination, we run a second classification stage. We remap events into a semantically aligned feature space, apply richer feature engineering, and retrain to obtain cleaner label probabilities across the full dataset. This classifier is then used directly in the reinforcement-learning reward structure.
-
 \subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
 
 We formulate pricing as a Stackelberg game: the platform (leader) sets prices $p_t$, and the population (follower) responds through trajectories and demand. A useful intuition is that the platform behaves like a distorted mirror at a 45-degree angle: what it mirrors is population demand into an estimated demand proxy, and that proxy drives revenue.
@@ -383,6 +386,44 @@ For the current engine baseline, we use a compact inner-robust approximation by
 and we evaluate a small fixed grid in $\mathcal{A}_{\epsilon_\alpha}(\alpha_0)$ per step, selecting the worst-case candidate for the learner.
 % A proper Wasserstein ball implementation over the full demand distribution (rather than a scalar alpha interval) would use the POT library (Python Optimal Transport): compute W_2 between the empirical reference P_hat and each candidate Q using ot.emd2() or ot.sliced_wasserstein_distance() for scalability, then accept only candidates within epsilon. In practice the inner minimization becomes: candidates = [G(alpha) for alpha in linspace]; dists = [ot.emd2(p_hat, q, M) for q in candidates]; worst = candidates[argmin(reward[dists <= epsilon])]. The current grid-on-alpha approximation is a computationally cheap substitute; moving to a true Wasserstein ball would tighten the worst-case guarantee but requires specifying the ground metric M over the demand space.
 
+
+\subsubsection{Environment Setup for Dynamic Pricing}
+The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:oracle_flow}. The Oracle maps historical price and demand state to a new price vector, which is exposed to a distribution of demand curves. Each product generates trajectories weighted by behavioral kernels $\tau_\theta$, producing a full transition matrix $\tau'$ over sessions. Sampled trajectories $\{\tau_k\}$ are aggregated through the demand proxy function $Q(\cdot)$ to yield the next demand vector, which feeds back into the Oracle.
+
+\begin{figure}[ht]
+\centering
+\[
+\text{Oracle}(\vec{p}_{t-1},\vec{\hat{q}})\to
+\begin{pmatrix}
+p_0\\
+p_1\\
+\cdots\\
+p_N
+\end{pmatrix}
+\underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}}
+\begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix}
+\underrightarrow{\vec{d}\times \tau_\theta \to \tau^\prime}
+\begin{bmatrix}
+0.01 & 0.02 & \cdots & 0.3 \\
+0.41 & 0.24 & \cdots & 0.0 \\
+\cdots & \cdots & \cdots & \cdots \\
+0.51 & 0.09 & \cdots & 0.1 \\
+\end{bmatrix}
+\underrightarrow{\tau_k \sim \tau^\prime}
+\{\tau_k\}_{k=0}^K \to \hat{Q}(\tau_k)
+\\
+\to \begin{pmatrix}
+\hat{q}_0 \\
+\hat{q}_1 \\
+\cdots \\
+\hat{q}_N \\
+\end{pmatrix}
+\to \text{Oracle}(\cdot)
+\]
+\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated by mixing demand with behavioral kernels $\tau_\theta$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
+\label{fig:oracle_flow}
+\end{figure}
+
 \subsubsection{The Min-Max Objective}
 The robust policy $\pi^*$ is obtained by solving the maximin problem:
 \begin{equation}
diff --git a/paper/src/chapters/04-results.tex b/paper/src/chapters/04-results.tex
index b244efd..f541a55 100644
--- a/paper/src/chapters/04-results.tex
+++ b/paper/src/chapters/04-results.tex
@@ -6,9 +6,11 @@
     \label{fig:supra_heatmap}
 \end{figure}
 
+
+
 \subsection{Behavioral Analysis}
 
-The transition-kernel analysis is evaluated with both between-class divergence and an intra-class bootstrap null baseline. This allows us to separate real behavioral differences from finite-sample estimation noise.
+The transition-kernel analysis is evaluated with both between-class divergence and an intra-class bootstrap null baseline. This allows us to separate real behavioral differences from finite-sample estimation noise and bias.
 
 \begin{table}[ht]
 \centering
@@ -25,7 +27,9 @@ Agent intra-class split & 1.2065 & 1.2607 & 0.2177 & 4.2345 \\
 \end{tabular}
 \end{table}
 
-For this run ($n_H=11$, $n_A=7$, $B=100$), the pooled lift ratio is $2.84\times$ and the empirical one-sided p-value is $0.0149$, both computed as defined in Section~\ref{sec:tpe}. This places the between-class divergence clearly above the intra-class null and supports the use of divergence-derived contamination signals in downstream pricing control.
+For this run ($n_H=11$, $n_A=7$, $B=100$), the empirical p-value is $0.0149$, both computed as defined in Section~\ref{sec:tpe}. This places the between-class divergence clearly above the intra-class null and supports the use of divergence-derived contamination signals in downstream pricing control.
+
+% TODO: instead could we do a simple t test to see the difference in the means in some way? That way we can yield a P value
 
 
 \subsection{Experimental Outcomes}
diff --git a/paper/src/chapters/05-discussion.tex b/paper/src/chapters/05-discussion.tex
index 6cd6362..51f6600 100644
--- a/paper/src/chapters/05-discussion.tex
+++ b/paper/src/chapters/05-discussion.tex
@@ -1,18 +1,20 @@
 \section{Discussion}
 
+
+
 \subsection{Transition to Agentic Market Microstructure}
 
 Our analysis of the interaction dynamics between the platform and non-human actors suggests that the current static pricing models are insufficient for an agent-mediated economy. If we assume a transition toward a direct revelation mechanism, where actors must reveal their true valuation of a good through bidding dynamics, we inevitably introduce significant stochasticity into the pricing system. Unlike traditional e-commerce where prices are relatively sticky, such a mechanism implies a high volatility characteristic of financial equity markets (without the fungability however).
 
-However, ecommerce commodities differ fundamentally from financial securities: they possess a hard floor defined by unit economics and reservation prices. The market might react enthusiastically to an iPhone priced at \$1, such a transaction is not permissible. The platform must establish an initial valuation anchor ($P_{0}$) defined by the marginal cost plus a target margin, around which the market price is permitted to fluctuate. We propose the introduction of GenAI Agents as Institutional Market Makers.
-
-This is also under the assumption of expected transactional capabilities being given to AI Agents.
+However, ecommerce commodities differ fundamentally from financial securities: they possess a hard floor defined by unit economics and reservation prices. The market might react enthusiastically to an iPhone priced at \$1, such a transaction is not permissible. The platform must establish an initial valuation anchor ($P_{0}$) defined by the marginal cost plus a target margin, around which the market price is permitted to fluctuate. We float the introduction of GenAI Agents as Institutional Market Makers. As the arms race for greater autonomy of agnetic systems grows, the commercial viability of AI agents has the potential to disseminate into every-day users directly interacting with them rather than e-commerce platforms. This is also under the assumption of expected transactional capabilities being given to AI Agents.
 
 
 
 \subsection{Risk Assessment and Limitations}
 
-Acknowledge risks and constraints and data sizes.
+This technology does not come without a more bitter side, ethical concerns do arise from the idea of deploying black-box like solutions to set prices based on a behavioral attributes. Approaches like universal behavioral profile modeling (UBPM) used in recommendation systems is very broadly utilized.
+
+With a system like this there is potential for strong drift given the rapid advance of agentic systems and user preference. Our intent behind adding the UX term into the reward shaping process was to further address the risk of degraded user experience. Looking deeper at the underlying methodology, reinforcement learning does not come without it's complications such as reward hacking and often the lack of intepretability which is quite critical in systems that have a strong impact on the revenue of a company.
 
 \subsection{Implications of Findings}
 
diff --git a/paper/src/chapters/06-conclusion.tex b/paper/src/chapters/06-conclusion.tex
index c698e82..67cf0c6 100644
--- a/paper/src/chapters/06-conclusion.tex
+++ b/paper/src/chapters/06-conclusion.tex
@@ -1,8 +1,11 @@
 \section{Conclusion}
 
+For our troubles, we now conclude that...
+
 \subsection{Summary of contributions}
-Restate the thesis and key findings with validation of research objectives.
+The authors contribution was not without the advice of many experienced experts in the field. We thank (NAME) the director of innovation at Microsoft for the initial critical discussion on the topic of dynamic pricing systems and the spark which has lead to this work. Eugene, Bykovets pointing out the parallels in blockchain systems and the complexity of anonymous interaction and understanding of intent. Importantly, the contributions of Alverto Martin, my academic advisor for the support over and for taking on the challenge of this ambitious work. Many breakthroughs were thanks to numerous discussions with my peers on the topics covered here.
+A thanks to the head of innovation at Amadeus for insight into the industry split on the topic of collapsing margins. Finally we acknowledge the power and use of generative AI technologies for in depth research, rapid prototyping and surfacing of key topics and niches.
 
 \subsection{Future Works and Next Steps}
 
-Identify the research gaps here and potential business implications and setup of business + Proposed extensions and a long term agenda.
+During the eights months of research dedicated to this work, a plethora of opportunities and industry gaps was identified, sadly a majority of which could not be addressed directly.
diff --git a/paper/src/main.tex b/paper/src/main.tex
index c350741..45400ff 100644
--- a/paper/src/main.tex
+++ b/paper/src/main.tex
@@ -45,6 +45,13 @@ These behavioral signals serve as inputs for a Distributionally Robust Reinforce
 \begin{description}
 \item[Agent $A$] An actor of non-human nature, powered by an LLM.
 \item[Human $H$] An individual human with some job to be done.
+\item[Actor $\theta$] Defines a type of class which is either Agent or Human and has the capability to carry out actions on a web platform.
+\item[Platform] Any web-based platform which serves an interface to a collection of items that can be purchased, each at some price $p_i$.
+\item[Behavioral Model] A mathematical model predicting what action comes after a series of prior actions.
+\item[LLM] Large Language Model served by some provider with the abstracted capability of tool calling.
+\item[TPU] Tensor Processing Unit which is a unique kind of chip architecture developed by Google.
+\item[Trajectory] Defined as a series of unspecified length, collecting data on states of some object over time.
+% TODO: maybe define other things in a similar succient manner
 \end{description}
 % \input{../build/concatenated_code}