updating computation power graph

2026-07-16 01:53:37 +00:00 · 2026-03-08 14:22:54 +01:00
parent 17c128cbc0
commit 28dbcacd95
6 changed files with 142 additions and 114 deletions
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -210,8 +210,7 @@ The simulator has multiple configurable factors. We design a multi-factor study
 % Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n=18 per group. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.

-% TODO: cite in the apendix the math to get to 160 petaflops of compute
-Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160 PFLOPS of aggregate compute, which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
+Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.

 \begin{table}[ht]
 \centering
--- a/paper/src/main.tex
+++ b/paper/src/main.tex
@@ -53,6 +53,31 @@ These behavioral signals serve as inputs for a Distributionally Robust Reinforce
 \item[Trajectory] Defined as a series of unspecified length, collecting data on states of some object over time.
 % TODO: maybe define other things in a similar succient manner
 \end{description}
+
+\section{Aggregate Compute Budget Derivation}
+\label{app:compute_budget}
+
+The claimed peak throughput of approximately 160\,PFLOPS follows from multiplying the per-chip BF16 peak (from official Google Cloud TPU documentation) by the number of chips in each allocation tier and summing across generations.
+
+\begin{table}[ht]
+\centering
+\caption{Per-generation contribution to aggregate BF16 throughput.}
+\label{tab:compute_derivation}
+\begin{tabular}{@{}lrrr@{}}
+\toprule
+\textbf{TPU Gen.} & \textbf{Chips} & \textbf{Peak BF16/chip (TFLOPS)} & \textbf{Subtotal (TFLOPS)} \\
+\midrule
+v6e (Trillium) & 128 & 918 & $128 \times 918 = 117{,}504$ \\
+v5e            & 128 & 197 & $128 \times 197 = 25{,}216$  \\
+v4             &  64 & 275 & $64  \times 275 = 17{,}600$  \\
+\midrule
+\textbf{Total} & \textbf{320} & & $\mathbf{160{,}320}$ \\
+\bottomrule
+\end{tabular}
+\end{table}
+
+Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
+
 % \input{../build/concatenated_code}

 \end{document}