From 02328b20f2917650a46bdff993b8dd4fe16cc37f Mon Sep 17 00:00:00 2001
From: Daniel Rosel <daniel@alves.world>
Date: Thu, 9 Apr 2026 10:17:53 +0200
Subject: [PATCH] feat: adding clarity and rewording

---
 paper/src/chapters/03-methodology.tex       | 21 +++++++++------------
 paper/src/chapters/04-results.tex           |  4 ++--
 paper/src/chapters/05-discussion.tex        |  2 +-
 paper/src/main-genpop.tex                   |  9 +++++++++
 paper/src/main.tex                          | 14 ++++++++++++++
 paper/src/mirrors/genpop/03-methodology.tex |  4 ++--
 6 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/paper/src/chapters/03-methodology.tex b/paper/src/chapters/03-methodology.tex
index e98e27f..fce3c71 100644
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -221,9 +221,9 @@ To speak to realism, user interviews reported that the platform architecture mir
 The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. The responses match what one expects from live commerce: sharp reactions to volatility and to list--checkout gaps, which supports external validity despite the lab setting.
 
 
-\subsubsection{Design of Training Factorial Study}
+\subsubsection{Design of Training Sweeps}
 
-The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
+The simulator has multiple configurable factors. Training runs are driven by Weights \& Biases sweep definitions versioned with the codebase, mixing random and grid schedules rather than a single full factorial. For the contamination ratio $\alpha$, exploratory sweeps draw $\alpha$ uniformly on $[0.1,0.6]$; some sweeps use the narrower interval $[0.1,0.5]$. Grid sweeps fix explicit level sets, for example $\alpha\in\{0.1,0.2,0.3,0.4,0.6,0.8\}$ (six levels, including $0.8$ beyond the typical exploratory upper endpoint) or five levels $\{0.1,0.2,0.3,0.4,0.6\}$. Auxiliary schedules also include $\alpha=0$ alongside positive values. Robustness radius $\epsilon_\alpha$, COI penalty $\lambda_\text{coi}$, RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}), and the discretization of the price action grid vary by sweep. Broad random search may use uniform $\epsilon_\alpha\in[0,0.3]$ and $\lambda_\text{coi}\in[0.05,0.6]$; tighter grids may fix $\epsilon_\alpha=0.2$ and restrict $\lambda_\text{coi}$ to $\{0.15,0.30\}$. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
 
 Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
@@ -261,8 +261,9 @@ v4 & 64 (32 + 32) & us-central2-b & 32 Spot + 32 On-demand \\
 \end{tabular}
 \end{table}
 
-For connections from Madrid, we prioritize the europe-west4 allocation for latency-sensitive runs with the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. % TODO: cite this (from bib)
-Hardware specifications are from the official Google Cloud TPU documentation \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}.
+For connections from Madrid, we prioritize the europe-west4 allocation for the sake of latency and the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}.
+% TODO: cite this (from bib)
+
 
 Training images follow Docker layer caching: dependency layers are separate from the copy of application source so routine code edits do not invalidate the entire build; only changes to the training entrypoint or dependencies force a full rebuild.
 
@@ -395,18 +396,14 @@ The session-level control signal injected into pricing is then
 
 This turns distinguishability into an operational control input in the engine. On a per-customer or use-case basis, a similar data collection and fitting process should be repeated to obtain domain-specific behavior kernels.
 
-In implementation we keep an alternating game-history buffer and advance it each epoch with two transitions: the platform publishes a price vector (leader move), then the environment returns trajectory-derived demand (follower move). The codebase names this structure \textit{Limbo}; the appendix lists it under the same label for readers who inspect the repository.
+In implementation we keep an alternating game-history buffer and advance it each epoch with two transitions where the platform publishes a price vector (leader move), then the environment returns trajectory-derived demand (follower move). We call this the \textit{Limbo}.
 
 To avoid notation drift, we separate two COI objects used for different purposes:
 \begin{align}
-\text{COI}_{\text{level}}(\pi) &= \mathbb{E}[P]-\underline{p} \quad \text{(global reporting KPI)} \\
+\text{COI}_{\text{level}}(\pi) &= \mathbb{E}[P]-\underline{p}\\
 \text{COI}_{\text{leak}}(p,\tau') &= f(\tau')\cdot \text{InfoValue}(p,\tau') \quad \text{(local control penalty)}
 \end{align}
-where $\text{COI}_{\text{level}}$ is evaluated at policy level and $\text{COI}_{\text{leak}}$ is evaluated per observed quote during training. We connect local leakage to expected global erosion with the operational assumption
-\begin{equation}
-\mathbb{E}[\Delta\text{COI}_{\text{level},t} \mid \tau_t'] \approx -\kappa\,\text{COI}_{\text{leak}}(p_t,\tau_t') + \xi_t,
-\end{equation}
-where $\kappa>0$ and $\xi_t$ is residual noise. This keeps theorem-level COI erosion (global, asymptotic) distinct from training-time leakage control (local surrogate).
+where $\text{COI}_{\text{level}}$ is evaluated at policy level and $\text{COI}_{\text{leak}}$ is evaluated per observed quote during training. Subsequently, when discussing the reward structure, we will better understand the term of the information value.
 
 % Mention discretized action space and the clipping and over shotting in continuous action spaces
 % Also talk about catastrophic economics, we add termination on bankrupcy or zero demand so market collaps
@@ -481,7 +478,7 @@ In practice, we parameterize this with a session-level leakage term:
 \begin{equation}
 \text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot \text{InfoValue}(p,\tau')
 \end{equation}
-where $f(\tau')$ is the weak agent probability and $\text{InfoValue}$ is implemented either as a constant query-tax surrogate or as a revelation surrogate $-\log\pi(p\mid\tau')$.
+where $f(\tau')$ is the weak agent probability and $\text{InfoValue}$ is implemented either as a constant query-tax surrogate or as a revelation surrogate $-\log\pi(p\mid\tau')$. In the latter case, leakage is \emph{contamination-weighted surprisal}: $f(\tau')$ scales how much we treat the session as agentic, and $-\log\pi(p\mid\tau')$ scores how unexpected the realized quote is under the policy's own distribution over prices. Appendix~\ref{app:revelation_log} records why the logarithm is the conventional choice for that second factor.
 
 The inner minimization selects the contamination candidate that makes the penalized reward smallest, so the outer policy update faces the worst plausible leakage scenario inside the ambiguity set rather than an average case.
 
diff --git a/paper/src/chapters/04-results.tex b/paper/src/chapters/04-results.tex
index 2dce74b..1dc220d 100644
--- a/paper/src/chapters/04-results.tex
+++ b/paper/src/chapters/04-results.tex
@@ -2,7 +2,7 @@
 \begin{figure}[ht]
     \centering
     \input{chapters/figures/supra/supra.tex}
-    \caption{Evolution of price distributions over experiment steps. The heatmap illustrates the density of price offerings. This is an early baseline simulation which demonstrates supra-competitive price-setting in deep learning agents such as SAC as can be clearly seen by the high density at the highest available price.}
+    \caption{Evolution of price distributions over experiment steps. The heatmap illustrates the density of price offerings. This is an early baseline simulation which demonstrates supra-competitive price-setting in deep learning agents such as Soft Actor Critic as can be clearly seen by the high density at the highest available price.}
     \label{fig:supra_heatmap}
 \end{figure}
 
@@ -44,7 +44,7 @@ The contamination--revenue slope is estimated on a controlled cohort (single swe
 
 \begin{table}[ht]
 \centering
-\caption{Slope verification table for contamination versus revenue (OLS-style report).}
+\caption{Slope verification table for contamination versus revenue.}
 \label{tab:contamination_slope_table}
 \begin{tabular}{@{}lrrrrr@{}}
 \toprule
diff --git a/paper/src/chapters/05-discussion.tex b/paper/src/chapters/05-discussion.tex
index fa5ad55..ad7ff47 100644
--- a/paper/src/chapters/05-discussion.tex
+++ b/paper/src/chapters/05-discussion.tex
@@ -6,7 +6,7 @@
 
 Our analysis of interaction dynamics between the platform and non-human actors suggests that static posted-price models are a weak match for an economy in which software agents mediate search and purchase. If one pushes toward direct-revelation or auction-like pricing, volatility rises: prices behave more like traded claims than like sticky retail quotes, though without the fungibility of securities.
 
-E-commerce goods differ from financial assets in a hard way: unit economics and reservation values set a floor. The market might ``want'' an iPhone at \$1; the platform cannot honor that. Pricing therefore needs an anchor $P_{0}$ (cost plus target margin) around which offers may move. In that setting, large language model (LLM) agents resemble institutional liquidity providers: they quote, probe, and clear subsets of flow. As autonomy of agentic systems increases, end users may delegate browsing and checkout to assistants rather than to retailer sites directly, which shifts where demand signals originate. The scenario presumes agents eventually hold delegated payment authority; until then, our results bound a near-term reconnaissance-heavy regime.
+E-commerce goods differ from financial assets in a hard way: unit economics and reservation values set a floor. The market might ``want'' an iPhone at \$1, however that is not permissible. Pricing therefore needs an anchor $P_{0}$ (cost plus target margin) around which offers may move. In that setting, large language model (LLM) agents resemble institutional liquidity providers: they quote, probe, and clear subsets of flow. As autonomy of agentic systems increases, end users may delegate browsing and checkout to assistants rather than to retailer sites directly, which shifts where demand signals originate. The scenario presumes agents eventually hold delegated payment authority; until then, our results bound a near-term reconnaissance-heavy regime.
 
 \subsection{Risk Assessment and Limitations}
 \label{sec:limitations_risks}
diff --git a/paper/src/main-genpop.tex b/paper/src/main-genpop.tex
index 1c8fb1e..4a53f55 100644
--- a/paper/src/main-genpop.tex
+++ b/paper/src/main-genpop.tex
@@ -91,4 +91,13 @@ The textbook definition $D_{\mathrm{KL}}(P\parallel Q)=\sum_k P(k)\log(P(k)/Q(k)
 
 In code we do the boring fix: add a tiny floor $\varepsilon$ to both the numerator and denominator inside the log so nothing is exactly zero, which turns the sum into a finite, smoothed surrogate rather than a literal KL to raw counts. We also skip source states that do not exist at all in the reference kernel, because there is nowhere honest to compare against. This keeps the pipeline running and the divergence scores on a comparable scale, at the cost that the number is regularized KL-ish behavior, not a purist information-theoretic quantity---which is acceptable here because we only use the gap between human-anchored and agent-anchored scores as a weak separability signal, not as a calibrated physical constant.
 
+\section{Why the logarithm appears in the revelation surrogate}
+\label{app:revelation_log}
+
+Recall that $\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot\text{InfoValue}(p,\tau')$. The query-tax surrogate fixes $\text{InfoValue}$ to a positive constant: every suspected reconnaissance quote is penalized equally, which tracks the erosion story where independent query volume drives COI to zero. The revelation surrogate instead sets $\text{InfoValue}(p,\tau') = -\log \pi(p\mid\tau')$, where $\pi(\cdot\mid\tau')$ is the pricing policy's distribution over quoted prices in context $\tau'$ (after whatever discretization or binning the engine uses).
+
+For an outcome with probability $q$, the quantity $-\log q$ is \emph{surprisal}: likely draws are unsurprising, rare draws are highly surprising. That matches the informal ``surprise'' people talk about in recommender systems when they formalize novelty as low predicted probability---here the model is our own policy. The log is the standard information-theoretic way to turn ``how probable was this draw?'' into a penalty that grows sharply in the tails. In the reconnaissance reading, a price from a thin slice of the policy's support is more identifying than a typical quote.
+
+So the revelation form is \emph{contamination-weighted surprisal}: $f(\tau')$ scales how agent-like we judge the session, and $-\log\pi(p\mid\tau')$ scales how informative that price is relative to $\pi(\cdot\mid\tau')$. In code you still floor $\pi(p\mid\tau')$ away from zero so tail bins do not explode the penalty, same spirit as Appendix~\ref{app:kl_zeros}.
+
 \end{document}
diff --git a/paper/src/main.tex b/paper/src/main.tex
index c743f68..ec79886 100644
--- a/paper/src/main.tex
+++ b/paper/src/main.tex
@@ -123,6 +123,20 @@ The textbook definition $D_{\mathrm{KL}}(P\parallel Q)=\sum_k P(k)\log(P(k)/Q(k)
 In code we do the basic fix: add a tiny floor $\varepsilon$ to both the numerator and denominator inside the log so nothing is exactly zero, which turns the sum into a finite, smoothed surrogate rather than a literal KL to raw counts. We also skip source states that do not exist at all in the reference kernel, because there is nowhere honest to compare against. This keeps the pipeline running and the divergence scores on a comparable scale, at the cost that the number is regularized KL behavior, not a purist information-theoretic quantity, which is acceptable here because we only use the gap between human-anchored and agent-anchored scores as a weak separability signal.
 
 
+\section{Why the logarithm appears in the revelation surrogate}
+\label{app:revelation_log}
+
+Recall that $\text{COI}_{\text{leak}}(p,\tau') = f(\tau')\cdot\text{InfoValue}(p,\tau')$. The query-tax surrogate fixes $\text{InfoValue}$ to a positive constant: every suspected reconnaissance quote is penalized equally, which tracks the erosion theorem where independent query volume drives COI to zero. The revelation surrogate instead sets
+\begin{equation}
+\text{InfoValue}(p,\tau') = -\log \pi(p\mid\tau'),
+\end{equation}
+where $\pi(\cdot\mid\tau')$ is the pricing policy's distribution over quoted prices in context $\tau'$ (after whatever discretization or binning the engine uses).
+
+For an outcome that occurs with probability $q$, the quantity $-\log q$ is the usual \emph{surprisal}: likely draws have small surprisal, rare draws have large surprisal. That is the same ``surprise'' people import into recommender systems when they formalize novelty as low predicted probability under a model---here the model is our own policy. The log is not decorative: it is the standard information-theoretic coding of ``how unexpected was this draw under $\pi$?'' In the reconnaissance reading, a quote from a thin slice of the policy's support is more identifying than a modal quote, because it pins down what the rule is willing to do in places where little mass sits.
+
+Put together, the revelation form is \emph{contamination-weighted surprisal}: $f(\tau')$ scales how agent-like we judge the session, and $-\log\pi(p\mid\tau')$ scales how informative that realized price is relative to $\pi(\cdot\mid\tau')$. In implementation you still floor $\pi(p\mid\tau')$ away from zero so tail bins do not explode the penalty---the same honesty as Appendix~\ref{app:kl_zeros}: we use a regularized surrogate, not a literal infinite penalty.
+
+
 % \input{../build/concatenated_code}
 
 \end{document}
diff --git a/paper/src/mirrors/genpop/03-methodology.tex b/paper/src/mirrors/genpop/03-methodology.tex
index 505aa20..d6595e1 100644
--- a/paper/src/mirrors/genpop/03-methodology.tex
+++ b/paper/src/mirrors/genpop/03-methodology.tex
@@ -130,9 +130,9 @@ To speak to realism, user interviews reported that the platform architecture mir
 
 The dynamic pricing mechanism elicited immediate behavioral adjustments. Participants were sensitive to price volatility: sudden boosts triggered urgency and faster booking attempts, while large listing-to-final discrepancies triggered deeper comparison behavior. This is comforting because the controlled setup still produces commercially relevant interaction data.
 
-\subsubsection{Design of Training Factorial Study}
+\subsubsection{Design of Training Sweeps}
 
-The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (PPO, A2C, DQN, Q-table; 4 levels), (2) contamination ratio sampled at four representative levels between 0.1 and 0.6, (3) robustness radius (3 levels), (4) COI penalty weight at two reference levels, and (5) pricing action granularity (two discretization settings for action levels); giving a grid of 192 configurations. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
+The simulator has multiple configurable factors. Training runs are driven by Weights \& Biases sweep definitions versioned with the codebase, mixing random and grid schedules rather than a single full factorial. For the contamination ratio $\alpha$, exploratory sweeps draw $\alpha$ uniformly on $[0.1,0.6]$; some sweeps use the narrower interval $[0.1,0.5]$. Grid sweeps fix explicit level sets, for example $\alpha\in\{0.1,0.2,0.3,0.4,0.6,0.8\}$ (six levels, including $0.8$ beyond the typical exploratory upper endpoint) or five levels $\{0.1,0.2,0.3,0.4,0.6\}$. Auxiliary schedules also include $\alpha=0$ alongside positive values. Robustness radius $\epsilon_\alpha$, COI penalty $\lambda_\text{coi}$, RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}), and the discretization of the price action grid vary by sweep. Broad random search may use uniform $\epsilon_\alpha\in[0,0.3]$ and $\lambda_\text{coi}\in[0.05,0.6]$; tighter grids may fix $\epsilon_\alpha=0.2$ and restrict $\lambda_\text{coi}$ to $\{0.15,0.30\}$. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
 
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.