updating appendix

2026-07-15 17:43:36 +00:00 · 2026-04-10 11:29:21 +02:00
parent 03b4996bea
commit d36a34ead9
3 changed files with 1 additions and 1 deletions
--- a/paper/src/chapters/mdp_agent.pdf
+++ b/paper/src/chapters/mdp_agent.pdf
--- a/paper/src/chapters/mdp_human.pdf
+++ b/paper/src/chapters/mdp_human.pdf
--- a/paper/src/main.tex
+++ b/paper/src/main.tex
@@ -125,7 +125,7 @@ The textbook definition $D_{\mathrm{KL}}(P\parallel Q)=\sum_k P(k)\log(P(k)/Q(k)
 In code we do the basic fix: add a tiny floor $\varepsilon$ to both the numerator and denominator inside the log so nothing is exactly zero, which turns the sum into a finite, smoothed surrogate rather than a literal KL to raw counts. We also skip source states that do not exist at all in the reference kernel, because there is nowhere honest to compare against. This keeps the pipeline running and the divergence scores on a comparable scale, at the cost that the number is regularized KL behavior, not a purist information-theoretic quantity, which is acceptable here because we only use the gap between human-anchored and agent-anchored scores as a weak separability signal.


-\section{Why the logarithm appears in the revelation surrogate}
+\section{Expanding the Intuition of Information Value in the Reward}
 \label{app:revelation_log}

 Leakage is $\text{COI}_{\text{leak}} = f(\tau')\cdot\text{InfoValue}$. The query-tax form fixes $\text{InfoValue}=c>0$. The revelation form sets $\text{InfoValue}(p,\tau')=-\log\pi(p\mid\tau')$, with $\pi(\cdot\mid\tau')$ the policy distribution over quoted prices in context $\tau'$ (discretized as in the engine).