constrainative proposals

2026-07-16 01:53:37 +00:00 · 2026-01-19 19:03:04 +01:00
parent a0ddde32df
commit 0800ee189c
1 changed files with 22 additions and 2 deletions
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -225,6 +225,18 @@ To develop a robust pricing learner, we require a simulation environment capable
 \subsubsection{GOFAI-Based Separability}
 We employ Good Old-Fashioned AI (GOFAI) heuristics to generate initial weak labels for separability. We define a set of rule-based predicates $\phi_j: \tau \to \{0, 1\}$ to partition the dataset $\mathcal{D}$ into high-confidence sets $\mathcal{D}_H$ and $\mathcal{D}_A$. We construct distinct MDPs per each behavioral profile of humans and agents and from those we establish $D_{KL}$. From initial findings we compute a KL divergence of $\approx 2.0236$ across transition probabilities between states which can be seen in \ref{fig:human_mdp_viz} and \ref{fig:agent_mdp_viz}.
 \begin{definition}[Kullback-Leibler Divergence for Transition Distributions]
 Let $P_e$ and $Q_e$ be categorical distributions over destination states following event $e$, derived from human and agent trajectories respectively. The KL divergence between these distributions is:
 \begin{equation}
  D_{\mathrm{KL}}(P_e \parallel Q_e) = \sum_{k \in \mathcal{S}_e} P_e(k) \log \frac{P_e(k)}{Q_e(k)}
 \end{equation}
 where $\mathcal{S}_e$ denotes the set of destination events that follow $e$ in the human trajectories.
 \end{definition}
 To obtain this statistic we aggregate state transitions by their triggering event $e$ and treat the normalized outgoing probabilities as the categorical distributions $P_e$ (human) and $Q_e$ (agent). The computation intersects the event labels observed in both datasets, then iterates over each label and accumulates the log-ratio score. In practice this is implemented exactly as in \texttt{sim/rl/behavior_loader/models.py}: for each destination $k$ we multiply the human probability by the log of the probability ratio and add the result to the running sum. Large contributions (including the case where $Q_e(k)$ is near zero) point to intents, such as rapid checkout or repeated navigation, that the agent policy fails to reproduce and therefore drive the contamination analysis.
 With this divergence we train a contrastive learning method to estimate a weak probability of a given trajectory being an agent $f(\cdot) \to [0,1]$ which we can use as a leverage for a weighted sum. This is a first attempt at a more informed separability.
 \subsubsection{Transition Probability Estimation}
 \label{sec:tpe}
@@ -249,13 +261,21 @@ where $N(s, s')$ is the count of observed transitions. This allows us to constru
    \caption{Markov Decision Process visualization illustrating the behavioral transition dynamics for \textbf{agent} behavior profiles. The state space and transition probabilities are learned from observed session trajectories to enable generative contamination.}
    \label{fig:agent_mdp_viz}
  \end{figure}
 \subsection{Stronger Classification}
 We re-map the current event schema semantically to the event schema of another dataset. Our contaminated dataset is then used in another classifier where we can now also apply better feature engineering on other features while assigning correct lables to the entire dataset so the new dataset can be contaminated with $\mathcal{g}$ under some different contamination ration $\alpha$.
 This new classified can then be used in the reinforcement learning reward structure.
 \subsection{Distributionally Robust Reinforcement Learning (DR-RL)}
 We formulate the pricing problem as a Stackelberg Game where the Platform (Leader) sets prices $p_t$ and the Aggregate Demand (Follower) responds. However, the exact mixing parameter $\alpha$ and the demand distribution shift are non-stationary and unknown in online settings. Relying on a simple error term $\epsilon$ is insufficient. Instead, we adopt a Distributionally Robust Optimization (DRO) objective. To formulate the entire dependency chain from the trajctory $\tau^\prime$ which is a newly observed trajectory observed by the platform and generated by an unknown actor type (sampled over a behavioral profile defined in section \ref{sec:tpe}). As part of the dynamic pricing we need a mapping of demand parameterized by a trajectory and a price $\hat{Q}(p, \tau^\prime)$. For an observed trajectory we compute a new $\hat{\mathcal{T}}^\prime$ and using a baseline controlled observations of both $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$ we can compute during inference time the following:
 \begin{align}
-  \Delta_H = D_{KL}(\hat{\mathcal{T}}^\prime \vert\vert \bar{\mathcal{T}}_H) \\
+  \Delta_H = D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
-  \Delta_A = D_{KL}(\hat{\mathcal{T}}^\prime \vert\vert \bar{\mathcal{T}}_A)
+  \Delta_A = D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
 \end{align}
 This creates two centroid-like heuristics which can on a per-session granularity basis guide our mixing paramtere $\alpha$.