chore: refactoring, proper citation and updating on data and refs and apendices

This commit is contained in:
2026-03-15 21:15:23 +01:00
parent 0521a63937
commit 375445f260
5 changed files with 82 additions and 118 deletions

View File

@@ -645,3 +645,26 @@ What might be more surprising is that even when we adjust the temperature down t
year = {2025}, year = {2025},
file = {Snapshot:/home/velocitatem/Zotero/storage/U5JG4CNM/defeating-nondeterminism-in-llm-inference.html:text/html}, file = {Snapshot:/home/velocitatem/Zotero/storage/U5JG4CNM/defeating-nondeterminism-in-llm-inference.html:text/html},
} }
@misc{moritz_ray_2018,
title = {Ray: {A} {Distributed} {Framework} for {Emerging} {AI} {Applications}},
shorttitle = {Ray},
url = {http://arxiv.org/abs/1712.05889},
doi = {10.48550/arXiv.1712.05889},
abstract = {The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray---a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.},
urldate = {2026-03-13},
publisher = {arXiv},
author = {Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion},
month = sep,
year = {2018},
note = {arXiv:1712.05889 [cs]},
keywords = {Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing},
file = {Preprint PDF:/home/velocitatem/Zotero/storage/SUTDF5BP/Moritz et al. - 2018 - Ray A Distributed Framework for Emerging AI Applications.pdf:application/pdf;Snapshot:/home/velocitatem/Zotero/storage/5GV2DUAA/1712.html:text/html},
}
@misc{biewald_experiment_2020,
title = {Experiment {Tracking} with {Weights} and {Biases}},
url = {https://www.wandb.com/},
author = {Biewald, Lukas},
year = {2020},
}

View File

@@ -246,7 +246,8 @@ v4 & 64 (32 + 32) & us-central2-b & 32 Spot + 32 On-demand \\
\end{tabular} \end{tabular}
\end{table} \end{table}
For connections from Madrid, we prioritize the europe-west4 allocation for latency-sensitive runs with the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. Hardware specifications are from the official Google Cloud TPU documentation \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}. For connections from Madrid, we prioritize the europe-west4 allocation for latency-sensitive runs with the benefit of having the most grouped chips within a single region. This regional grouping is important for the deployment of our Kubernetes cluster which cannot span multiple regions. All sweep metadata, model checkpoints, and reward traces are logged in Weights \& Biases. % TODO: cite this (from bib)
Hardware specifications are from the official Google Cloud TPU documentation \parencite{noauthor_tpu_2026,noauthor_tpu_2025-1,noauthor_tpu_2025}.
Design of training processes: we build docker image with the fact in mind of different caching over layers in order to most speed up docker re-building and such we place the most volatile steps towards the end of the image building. What is means in practice is that any dependency installations are isolated so edits to source code do no trigger rebuilds. Only if we update our entry point of training a sweep, Docker will also rebuild the source-code copy stage. Design of training processes: we build docker image with the fact in mind of different caching over layers in order to most speed up docker re-building and such we place the most volatile steps towards the end of the image building. What is means in practice is that any dependency installations are isolated so edits to source code do no trigger rebuilds. Only if we update our entry point of training a sweep, Docker will also rebuild the source-code copy stage.
@@ -388,8 +389,10 @@ The complete pricing-demand-trajectory loop is illustrated in Figure~\ref{fig:or
\begin{figure}[ht] \begin{figure}[ht]
\centering \centering
\[ {\setlength{\arraycolsep}{4pt}%
\text{Oracle}(\vec{p}_{t-1},\vec{\hat{q}})\to \resizebox{0.98\linewidth}{!}{$
\begin{aligned}
&\text{Oracle}(\vec{p}_{t-1},\vec{\hat{q}})\to
\begin{pmatrix} \begin{pmatrix}
p_0\\ p_0\\
p_1\\ p_1\\
@@ -398,14 +401,15 @@ p_N
\end{pmatrix} \end{pmatrix}
\underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}} \underrightarrow{d_i \sim \mathcal{N}_{\vec{p}}}
\begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix} \begin{pmatrix}d_0\\ d_1\\ \cdots \\ d_N\end{pmatrix}
\underrightarrow{\vec{d}\times \tau_\theta \to \tau^\prime} \underrightarrow{\vec{d}\otimes \tau_\theta}
\begin{bmatrix} \begin{bmatrix}
0.01 & 0.02 & \cdots & 0.3 \\ 0.01 & 0.02 & \cdots & 0.3 \\
0.41 & 0.24 & \cdots & 0.0 \\ 0.41 & 0.24 & \cdots & 0.0 \\
\cdots & \cdots & \cdots & \cdots \\ \cdots & \cdots & \cdots & \cdots \\
0.51 & 0.09 & \cdots & 0.1 \\ 0.51 & 0.09 & \cdots & 0.1 \\
\end{bmatrix} \end{bmatrix}
\underrightarrow{\tau_k \sim \tau^\prime} \\
&\underrightarrow{\tau_k \sim \tau^\prime}
\{\tau_k\}_{k=0}^K \to \hat{Q}(\tau_k) \{\tau_k\}_{k=0}^K \to \hat{Q}(\tau_k)
\to \begin{pmatrix} \to \begin{pmatrix}
\hat{q}_0 \\ \hat{q}_0 \\
@@ -414,8 +418,10 @@ p_N
\hat{q}_N \\ \hat{q}_N \\
\end{pmatrix} \end{pmatrix}
\to \text{Oracle}(\cdot) \to \text{Oracle}(\cdot)
\] \end{aligned}
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated by mixing demand with behavioral kernels $\tau_\theta$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.} $}%
}
\caption{Oracle-based pricing loop: historical price and demand state map to a new price vector; each product samples demand curves from $\mathcal{N}_{\vec{p}}$; trajectories are generated via the Kronecker product $\vec{d}\otimes\tau_\theta$ into transition matrix $\tau'$; sampled trajectories $\{\tau_k\}$ aggregate through proxy $Q(\cdot)$ to yield updated demand $\vec{\hat{q}}$, closing the feedback loop.}
\label{fig:oracle_flow} \label{fig:oracle_flow}
\end{figure} \end{figure}
@@ -498,7 +504,7 @@ The algorithm operates in discrete epochs indexed by $t$. At each epoch, the pla
\subsection{Parallelization Strategy} \subsection{Parallelization Strategy}
To avoid preemption of compute mid-training we settle on using a v4 generation, 40 chip compute node with 5 parallel workers. The login node creates an orchestration node with Ray and we distribute ray compute nodes per each other worker. To avoid preemption of compute mid-training we settle on using a v4 generation, 40 chip compute node with 5 parallel workers. The login node creates an orchestration node with Ray \parencite{moritz_ray_2018} and we distribute ray compute nodes per each other worker.
\subsubsection{Computational Cost Analysis of the Simulation Step} \subsubsection{Computational Cost Analysis of the Simulation Step}
The per-step cost of Algorithm~\ref{alg:phantom_loop_clean} is not uniform across its components. To inform hardware provisioning and to identify where algorithmic improvements are most impactful, we profile the hot path of the engine using Python's \texttt{cProfile} instrumentation over 20 environment steps under two configurations: a baseline with the robustness inner loop disabled ($K=1$, $\epsilon_\alpha=0$) and a standard robust setting ($K=5$, $\epsilon_\alpha=0.2$). Both runs use $M=10$ sessions per market call and $N=3$ products. The per-step cost of Algorithm~\ref{alg:phantom_loop_clean} is not uniform across its components. To inform hardware provisioning and to identify where algorithmic improvements are most impactful, we profile the hot path of the engine using Python's \texttt{cProfile} instrumentation over 20 environment steps under two configurations: a baseline with the robustness inner loop disabled ($K=1$, $\epsilon_\alpha=0$) and a standard robust setting ($K=5$, $\epsilon_\alpha=0.2$). Both runs use $M=10$ sessions per market call and $N=3$ products.

View File

@@ -57,11 +57,7 @@ At pair level (same seed, tier, and contamination), robust exceeds non-robust in
\subsubsection{The Impact of Contamination on Revenue} \subsubsection{The Impact of Contamination on Revenue}
A linear slope test on run-level data ($n=95$) shows a strong negative association between contamination and mean revenue. The fitted model is A linear slope test on run-level data ($n=95$) shows a strong negative association between contamination and mean revenue. The fitted model mapping $\alpha \to \text{revenue}$ result in $t(93)=-8.2148$, $p=1.20\times 10^{-12}$, $R^2=0.4205$, and a 95\% confidence interval for the slope of $[-75{,}288.76,\,-45{,}975.13]$. In practical terms, a $+0.1$ increase in $\alpha$ corresponds to an average decrease of about $6{,}063$ revenue units. A compact Appendix~\ref{app:alpha_revenue_slope} expansion can be found for these values using standard Python test methods.
\[
\widehat{\text{revenue}} = 326{,}878.57 - 60{,}631.95\,\alpha,
\]
with $t(93)=-8.2148$, $p=1.20\times 10^{-12}$, $R^2=0.4205$, and a 95\% confidence interval for the slope of $[-75{,}288.76,\,-45{,}975.13]$. In practical terms, a $+0.1$ increase in $\alpha$ corresponds to an average decrease of about $6{,}063$ revenue units. The full derivation (sample moments, least-squares coefficients, residual variance, standard error, test statistic, and confidence interval) is reported in Appendix~\ref{app:alpha_revenue_slope}.
\subsection{Interpretation and Insights} \subsection{Interpretation and Insights}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 84 KiB

After

Width:  |  Height:  |  Size: 324 KiB

View File

@@ -46,15 +46,44 @@ These behavioral signals serve as inputs for a Distributionally Robust Reinforce
\appendix \appendix
\section{Terminology} \section{Terminology}
\begin{description} \begin{description}
\item[Agent $A$] An actor of non-human nature, powered by an LLM. \item[Agent $A$] A non-human actor, typically an LLM-driven system that executes web actions toward a goal.
\item[Human $H$] An individual human with some job to be done. \item[Human $H$] A human participant interacting with the platform to complete a task.
\item[Actor $\theta$] Defines a type of class which is either Agent or Human and has the capability to carry out actions on a web platform. \item[Actor Type $\theta$] A latent class parameter describing whether a session is generated by a human or an agent profile.
\item[Platform] Any web-based platform which serves an interface to a collection of items that can be purchased, each at some price $p_i$. \item[Platform] A web interface exposing purchasable items and their offered prices.
\item[Behavioral Model] A mathematical model predicting what action comes after a series of prior actions. \item[Session $s$] A bounded interaction record tied to one actor and one session identifier.
\item[LLM] Large Language Model served by some provider with the abstracted capability of tool calling. \item[Event $e_{s,k}$] A single interaction tuple in a session, including action, item target, and timestamp.
\item[TPU] Tensor Processing Unit which is a unique kind of chip architecture developed by Google. \item[Trajectory $\tau_s$] The ordered sequence of events generated within a session.
\item[Trajectory] Defined as a series of unspecified length, collecting data on states of some object over time. \item[Demand Proxy $\hat{q}_{t,i}$] A weighted aggregate of observed actions used as an operational substitute for latent demand.
% TODO: maybe define other things in a similar succient manner \item[Action Weight Function $\omega(a)$] A mapping from action type to signal strength in the demand proxy.
\item[True Demand $d(p;\theta)$] The latent purchase response as a function of price and actor type.
\item[Contamination $\alpha$] The proportion of agent-generated traffic in the session mixture.
\item[Non-stationary Noise $\epsilon_t$] Time-varying residual variation not explained by the actor mixture.
\item[Pricing Policy $\pi(\tau)$] A function mapping observed interaction history to an offered price.
\item[Cost of Information (COI)] The expected premium above the minimum viable price induced by the pricing policy.
\item[COI Leakage] A per-quote penalty term modeling information revealed to reconnaissance behavior.
\item[First-Order Statistic $p_{(1)}$] The minimum observed price among multiple independent queries.
\item[Transition Kernel $\mathcal{T}$] A Markov transition matrix over behavioral states or actions.
\item[Separability] The degree to which human and agent sessions can be distinguished from behavior alone.
\item[KL Divergence $D_{KL}$] A relative-entropy measure used to compare session transition structure against class prototypes.
\item[Divergence Scores $\Delta_H,\Delta_A$] Session-level distances to human and agent transition centroids.
\item[Weak Agent Probability $f(\tau)$] A session-level score estimating the likelihood that a trajectory is agent-generated.
\item[Contamination Generator $\mathcal{G}(\alpha)$] A simulator component that injects synthetic agent trajectories to reach a target mixture level.
\item[Stackelberg Game] A leader-follower formulation where the platform sets prices and demand responds.
\item[Ambiguity Set $\mathcal{U}_{\epsilon}$] A set of plausible demand distributions considered under distributional uncertainty.
\item[Wasserstein Ball] A distance-bounded neighborhood around an empirical distribution used in robust optimization.
\item[DR-RL] Distributionally Robust Reinforcement Learning for policies trained against worst-case distributional shifts.
\item[Nominal Contamination $\alpha_0$] The baseline contamination level around which robust candidates are evaluated.
\item[Robustness Radius $\epsilon_\alpha$] The local interval width used for inner minimization over contamination scenarios.
\item[Query-Tax Surrogate] A constant leakage proxy assigning fixed penalty to suspected reconnaissance queries.
\item[Revelation Surrogate] A leakage proxy based on $-\log\pi(p\mid\tau)$ to penalize highly informative quotes.
\item[Limbo Stack] The alternating game-history buffer that stores leader price moves and follower demand responses.
\item[UX Index] A bounded user-experience metric tracked to evaluate policy side effects on legitimate users.
\item[Look-to-Book Ratio] The ratio of search-like interactions to completed purchases, used as an operational contamination indicator.
\item[Hybrid Kappa-Lambda Architecture] A data design combining streaming ingestion with offline and batch learning loops.
\item[MDP / POMDP] Sequential decision models with full observability (MDP) or partial observability (POMDP).
\item[Behavioral Model] A model predicting what action is likely to follow from prior actions.
\item[LLM] Large Language Model served through an inference provider with tool-use capability.
\item[TPU] Tensor Processing Unit, a specialized accelerator architecture developed by Google.
\end{description} \end{description}
\section{Aggregate Compute Budget Derivation} \section{Aggregate Compute Budget Derivation}
@@ -81,109 +110,19 @@ v4 & 64 & 275 & $64 \times 275 = 17{,}600$ \\
Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions. Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
\section{Full Slope-Test Derivation: Revenue vs. Contamination} \section{Slope-Test Verification: Revenue vs. Contamination}
\label{app:alpha_revenue_slope} \label{app:alpha_revenue_slope}
This appendix gives the full ordinary least squares computation for the linear effect of contamination on mean revenue. Let This appendix provides a compact verification of the slope result reported in the main results section. Using the same run-level pairs $x_i=\texttt{study/alpha}_i$ and $y_i=\texttt{eval/revenue\_mean}_i$ ($n=95$), we re-checked the ordinary least squares slope test in Python with standard test routines (SciPy two-sided $t$ test for the slope).
\[ \[
x_i = \texttt{study/alpha}_i, \qquad y_i = \texttt{eval/revenue\_mean}_i, \widehat{y}=326{,}878.57-60{,}631.95\,x,
\] \]
and fit
\[ \[
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,\dots,n. t(93)=-8.2148,\qquad p=1.2038\times 10^{-12},\qquad R^2=0.4205,\qquad 95\%\,\text{CI}_{\beta_1}=[-75{,}288.76,\,-45{,}975.13].
\]
The slope test is
\[
H_0: \beta_1 = 0 \qquad \text{vs.} \qquad H_1: \beta_1 \neq 0.
\] \]
\subsection{Sample moments and least-squares coefficients} The Python verification reproduces the reported coefficients and inference values, confirming that the slope-test results are correct under standard methods.
From the data:
\[
n=95, \qquad \bar{x}=0.3810526316, \qquad \bar{y}=303{,}774.6096.
\]
Define
\[
S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})^2, \qquad S_{xy}=\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}).
\]
Numerically,
\[
S_{xx}=7.0508947368, \qquad S_{xy}=-427{,}509.4691.
\]
The least-squares slope and intercept are
\[
\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{-427{,}509.4691}{7.0508947368} = -60{,}631.9460,
\]
\[
\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x} = 303{,}774.6096 - (-60{,}631.9460)(0.3810526316) = 326{,}878.5722.
\]
So the fitted line is
\[
\hat{y} = 326{,}878.5722 - 60{,}631.9460\,x.
\]
\subsection{Residual variance and standard error of the slope}
For each observation, $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ and $e_i = y_i - \hat{y}_i$. The residual sum of squares is
\[
\mathrm{SSE} = \sum_{i=1}^{n} e_i^2 = 35{,}721{,}896{,}352.27375.
\]
With $df=n-2=93$,
\[
\mathrm{MSE} = \frac{\mathrm{SSE}}{n-2} = \frac{35{,}721{,}896{,}352.27375}{93} = 384{,}106{,}412.3900.
\]
The slope standard error is
\[
SE(\hat{\beta}_1) = \sqrt{\frac{\mathrm{MSE}}{S_{xx}}} = \sqrt{\frac{384{,}106{,}412.3900}{7.0508947368}} = 7{,}380.8038.
\]
\subsection{t-statistic, p-value, and confidence interval}
Under $H_0: \beta_1=0$,
\[
t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{-60{,}631.9460}{7{,}380.8038} = -8.2148,
\]
with $df=93$. The two-sided p-value is
\[
p = 2\,\Pr\left(T_{93} \ge |t|\right) = 1.2038\times 10^{-12}.
\]
The 95\% confidence interval is
\[
\hat{\beta}_1 \pm t_{0.975,93}\,SE(\hat{\beta}_1)
= -60{,}631.9460 \pm (1.9858)(7{,}380.8038)
= [-75{,}288.7597,\,-45{,}975.1324].
\]
\subsection{Effect size and fit statistics}
The sample correlation is $r=-0.64846$, so
\[
R^2 = r^2 = 0.4205.
\]
Hence, 42.05\% of the variation in \texttt{eval/revenue\_mean} is explained by a linear trend in \texttt{study/alpha}.
The slope interpretation is direct:
\[
\hat{\beta}_1 = -60{,}631.9460 \quad \Rightarrow \quad \Delta y \approx -6{,}063.19 \text{ for } \Delta x = +0.1.
\]
From $\alpha=0$ to $\alpha=0.8$, the fitted drop is
\[
0.8\times (-60{,}631.9460) = -48{,}505.5568,
\]
so the model predicts roughly $48{,}506$ lower revenue units on average.
\subsection{Conclusion of the slope test}
The estimated model is
\[
\hat{y}=326{,}878.57-60{,}631.95\,x,
\]
with
\[
t(93)=-8.2148, \qquad p=1.2038\times 10^{-12}, \qquad 95\%\,\text{CI}=[-75{,}288.76,\,-45{,}975.13].
\]
The slope is therefore strongly negative and statistically different from zero.
% \input{../build/concatenated_code} % \input{../build/concatenated_code}