mirror of
https://github.com/velocitatem/PHANTOM.git
synced 2026-05-31 16:43:36 +00:00
updating computation power graph
This commit is contained in:
@@ -210,8 +210,7 @@ The simulator has multiple configurable factors. We design a multi-factor study
|
||||
% Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n=18 per group. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
|
||||
While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
|
||||
|
||||
% TODO: cite in the apendix the math to get to 160 petaflops of compute
|
||||
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160 PFLOPS of aggregate compute, which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
|
||||
Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
|
||||
@@ -53,6 +53,31 @@ These behavioral signals serve as inputs for a Distributionally Robust Reinforce
|
||||
\item[Trajectory] Defined as a series of unspecified length, collecting data on states of some object over time.
|
||||
% TODO: maybe define other things in a similar succient manner
|
||||
\end{description}
|
||||
|
||||
\section{Aggregate Compute Budget Derivation}
|
||||
\label{app:compute_budget}
|
||||
|
||||
The claimed peak throughput of approximately 160\,PFLOPS follows from multiplying the per-chip BF16 peak (from official Google Cloud TPU documentation) by the number of chips in each allocation tier and summing across generations.
|
||||
|
||||
\begin{table}[ht]
|
||||
\centering
|
||||
\caption{Per-generation contribution to aggregate BF16 throughput.}
|
||||
\label{tab:compute_derivation}
|
||||
\begin{tabular}{@{}lrrr@{}}
|
||||
\toprule
|
||||
\textbf{TPU Gen.} & \textbf{Chips} & \textbf{Peak BF16/chip (TFLOPS)} & \textbf{Subtotal (TFLOPS)} \\
|
||||
\midrule
|
||||
v6e (Trillium) & 128 & 918 & $128 \times 918 = 117{,}504$ \\
|
||||
v5e & 128 & 197 & $128 \times 197 = 25{,}216$ \\
|
||||
v4 & 64 & 275 & $64 \times 275 = 17{,}600$ \\
|
||||
\midrule
|
||||
\textbf{Total} & \textbf{320} & & $\mathbf{160{,}320}$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
|
||||
|
||||
% \input{../build/concatenated_code}
|
||||
|
||||
\end{document}
|
||||
|
||||
Reference in New Issue
Block a user