adding missing ideas and apendix kl

2026-07-15 17:43:36 +00:00 · 2026-04-08 21:46:54 +02:00
parent cc823ec63c
commit e72e3c81c1
5 changed files with 16 additions and 11 deletions
--- a/paper/src/auto/main.el
+++ b/paper/src/auto/main.el
@@ -21,6 +21,6 @@
   (LaTeX-add-labels
    "app:compute_budget"
    "tab:compute_derivation"
-    "app:whoclicked_card"))
+    "app:kl_zeros"))
 :latex)
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -223,8 +223,7 @@ The dynamic pricing mechanism elicited immediate behavioral adjustments. Partici
 \subsubsection{Design of Training Factorial Study}
-The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Statistical power for the behavioral comparisons is determined by a two-sample test over per-session KL divergence scores; a formal power analysis with minimum detectable effect size at $n_H=13$, $n_A=16$ is reported in the results.
+The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (\texttt{ppo}, \texttt{a2c}, \texttt{dqn}, \texttt{qtable}; 4 levels), (2) contamination ratio $\alpha$ sampled from $[0.1, 0.6]$ at four representative levels, (3) robustness radius $\epsilon_\alpha \in \{0.0, 0.15, 0.3\}$ (3 levels), (4) COI penalty weight $\lambda_\text{coi}$ at two reference levels, and (5) pricing action granularity (two discretization settings for \texttt{action\_levels}); giving a grid of $4\times4\times3\times2\times2 = 192$ configurations. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
 % Power analysis plan: apply a two-sample Mann-Whitney U (or permutation test) on per-session (delta_H - delta_A) divergence scores comparing the human and agent groups. Compute minimum detectable effect size at alpha=0.05, power=0.8, given n_H=13 and n_A=16. Bootstrap confidence intervals on mean KL are a cleaner complement given the non-normality of divergence distributions.
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.
 Our training budget is provisioned through TPU Research Cloud and spans 384 chips across TPU v4, v5e, and v6e generations, with a spot-heavy allocation plus an on-demand reserve. At peak BF16 throughput this corresponds to approximately 160\,PFLOPS of aggregate compute (derivation in Appendix~\ref{app:compute_budget}), which makes repeated seeds, ablations, and sensitivity sweeps feasible within practical wall-clock limits. We allocate v6e capacity to the highest-intensity policy training jobs, use v5e for wider hyperparameter exploration where throughput-per-dollar is favorable, and reserve on-demand v4 capacity for runs that should not be interrupted.
@@ -504,6 +503,7 @@ Practical implementation of browser agents is a strongly evolving field with nea
 As part of reward engineering, we keep a UX factor ($UX\in[0,1]$) as an auxiliary evaluation axis. In code, the UX index is implemented as a volatility penalty on relative price changes, with an extra upward-volatility component weighted by $0.5$ and scaled by $\eta_{\text{ux}}$ and an information-budget term. We also keep a separate supra-competitive penalty tied to persistent price excess above a competitive anchor, which punishes high-price behavior even when volatility is low.
 We measure volatility as mean absolute relative price movement, $v_t=\frac{1}{N}\sum_{i=1}^N\bigl|(p_{t,i}-p_{t-1,i})/\max(p_{t-1,i},1)\bigr|$.
 \begin{figure}[ht]
  \centering
--- a/paper/src/main-genpop.tex
+++ b/paper/src/main-genpop.tex
@@ -84,4 +84,11 @@ v4             &  64 & 275 & $64  \times 275 = 17{,}600$  \\
 Converting to petaFLOPS: 160,320 TFLOPS equals approximately 160 PFLOPS. This is the theoretical peak under sustained arithmetic operations; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
 \section{KL divergence when the reference has zeros}
 \label{app:kl_zeros}
 The textbook definition $D_{\mathrm{KL}}(P\parallel Q)=\sum_k P(k)\log(P(k)/Q(k))$ is not usable as-is when our empirical reference puts $Q(k)=0$ somewhere the session distribution still visits: if $P(k)>0$ and $Q(k)=0$, that term wants to blow up to infinity. With only 29 sessions the estimated transition rows are incredibly sparse, so ``never seen in the prototype'' happens a lot.
 In code we do the boring fix: add a tiny floor $\varepsilon$ to both the numerator and denominator inside the log so nothing is exactly zero, which turns the sum into a finite, smoothed surrogate rather than a literal KL to raw counts. We also skip source states that do not exist at all in the reference kernel, because there is nowhere honest to compare against. This keeps the pipeline running and the divergence scores on a comparable scale, at the cost that the number is regularized KL-ish behavior, not a purist information-theoretic quantity---which is acceptable here because we only use the gap between human-anchored and agent-anchored scores as a weak separability signal, not as a calibrated physical constant.
 \end{document}
--- a/paper/src/main.tex
+++ b/paper/src/main.tex
@@ -111,15 +111,13 @@ v4             &  64 & 275 & $64  \times 275 = 17{,}600$  \\
 Converting to petaFLOPS: $160{,}320\;\text{TFLOPS} = 160.32\;\text{PFLOPS} \approx 160\;\text{PFLOPS}$. This is the theoretical peak under sustained BF16 arithmetic; realized throughput depends on memory bandwidth utilization and inter-chip communication overhead, but the figure serves as a useful upper bound for provisioning decisions.
-\section{whoclickedit Dataset Card}
+\section{KL divergence when the reference has zeros}
-\label{app:whoclicked_card}
+\label{app:kl_zeros}
-For transparency and reproducibility, this appendix includes the full dataset card used for the public release of the \texttt{whoclickedit} dataset.
+The textbook definition $D_{\mathrm{KL}}(P\parallel Q)=\sum_k P(k)\log(P(k)/Q(k))$ is not usable as-is when our empirical reference puts $Q(k)=0$ somewhere the session distribution still visits: if $P(k)>0$ and $Q(k)=0$, that term wants to blow up to infinity. With only 29 sessions the estimated transition rows are incredibly sparse.
 In code we do the basic fix: add a tiny floor $\varepsilon$ to both the numerator and denominator inside the log so nothing is exactly zero, which turns the sum into a finite, smoothed surrogate rather than a literal KL to raw counts. We also skip source states that do not exist at all in the reference kernel, because there is nowhere honest to compare against. This keeps the pipeline running and the divergence scores on a comparable scale, at the cost that the number is regularized KL behavior, not a purist information-theoretic quantity, which is acceptable here because we only use the gap between human-anchored and agent-anchored scores as a weak separability signal.
 \lstinputlisting[
  caption={whoclickedit dataset card (README snapshot)},
  label={lst:whoclicked_dataset_card}
 ]{chapters/auto/whoclicked_dataset_card.md}
 % \input{../build/concatenated_code}
--- a/paper/src/mirrors/genpop/03-methodology.tex
+++ b/paper/src/mirrors/genpop/03-methodology.tex
@@ -132,7 +132,7 @@ The dynamic pricing mechanism elicited immediate behavioral adjustments. Partici
 \subsubsection{Design of Training Factorial Study}
-The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (PPO, A2C, DQN, Q-table; 4 levels), (2) contamination ratio sampled at four representative levels between 0.1 and 0.6, (3) robustness radius (3 levels), (4) COI penalty weight at two reference levels, and (5) pricing action granularity (two discretization settings for action levels); giving a grid of 192 configurations. Statistical power for the behavioral comparisons is determined by a two-sample test over per-session divergence scores.
+The simulator has multiple configurable factors. We design a multi-factor study across five axes derived from the sweep configurations: (1) RL algorithm (PPO, A2C, DQN, Q-table; 4 levels), (2) contamination ratio sampled at four representative levels between 0.1 and 0.6, (3) robustness radius (3 levels), (4) COI penalty weight at two reference levels, and (5) pricing action granularity (two discretization settings for action levels); giving a grid of 192 configurations. Behavioral distinguishability is assessed with a two-sample Mann--Whitney test on per-session divergence gap scores at cohort sizes $n_H=13$ and $n_A=16$.
 While this scale is generally expensive for reinforcement learning, we execute it on a large TPU cluster to make the sweep tractable.