updating with newly built algoerith

2026-07-15 17:43:36 +00:00 · 2026-01-24 12:20:40 +01:00
parent 7d55a0ee4c
commit 88cb1251ea
1 changed files with 47 additions and 2 deletions
--- a/paper/src/chapters/03-methodology.tex
+++ b/paper/src/chapters/03-methodology.tex
@@ -27,6 +27,7 @@ where $\omega: \mathcal{A} \to \mathbb{R}_+$ assigns weights to actions based on
 \subsubsection{Actor Types and Demand Curves}
 We formalize the heterogeneity of actors by introducing a type space $\Theta$. An actor of class $Y_s$ is further parameterized by a type $\theta \sim \mathcal{D}_{Y}$. This type determines the actor's demand response function $d(p; \theta)$, sampled from a distribution of possible demand curves. The total observed demand is a stochastic process governed by the naively defined mixture:
 \begin{equation}
 \label{eq:mixture_demand}
 Q(p) = (1-\alpha) \cdot \mathbb{E}_{\theta \sim \mathcal{D}_H}[d(p; \theta)] + \alpha \cdot \mathbb{E}_{\theta \sim \mathcal{D}_A}[d(p; \theta)] + \epsilon_t
 \end{equation}
 where $\alpha \in [0, 1]$ represents the contamination parameter (proportion of agents) and $\epsilon_t$ is non-stationary market noise.
@@ -274,8 +275,10 @@ This new classified can then be used in the reinforcement learning reward struct
 We formulate the pricing problem as a Stackelberg Game where the Platform (Leader) sets prices $p_t$ and the Aggregate Demand (Follower) responds. However, the exact mixing parameter $\alpha$ and the demand distribution shift are non-stationary and unknown in online settings. Relying on a simple error term $\epsilon$ is insufficient. Instead, we adopt a Distributionally Robust Optimization (DRO) objective. To formulate the entire dependency chain from the trajctory $\tau^\prime$ which is a newly observed trajectory observed by the platform and generated by an unknown actor type (sampled over a behavioral profile defined in section \ref{sec:tpe}). As part of the dynamic pricing we need a mapping of demand parameterized by a trajectory and a price $\hat{Q}(p, \tau^\prime)$. For an observed trajectory we compute a new $\hat{\mathcal{T}}^\prime$ and using a baseline controlled observations of both $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$ we can compute during inference time the following:
 \begin{align}
-  \Delta_H = D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
+  \label{eq:delta_H}
-  \Delta_A = D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
+  \Delta_H &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_H) \\
  \label{eq:delta_A}
  \Delta_A &= D_{KL}(\hat{\mathcal{T}}^\prime \parallel \bar{\mathcal{T}}_A)
 \end{align}
 This creates two centroid-like heuristics which can on a per-session granularity basis guide our mixing paramtere $\alpha$.
@@ -290,6 +293,7 @@ This set captures all distributions that are statistically close to our observed
 \subsubsection{The Min-Max Objective}
 The robust policy $\pi^*$ is obtained by solving the maximin problem:
 \begin{equation}
 \label{eq:robust_policy}
 \pi^* = \arg \max_{\pi} \min_{Q \in \mathcal{U}_\epsilon} \mathbb{E}_{d \sim Q} \left[ R(p, d) - \lambda \cdot \text{COI}(p) \right]
 \end{equation}
 where $R(p, d)$ is the revenue function and $\lambda$ weighs the penalty for information leakage (COI). We previously defined $\text{COI}$, however to properly connect this concept into the reward structure we need to define a parametrized version which informs us of the leakage of said structure with $\text{COI}(p)$.
@@ -312,6 +316,47 @@ As part of our reward engineering we think about the UX factor ($UX \in [0,1]$)
 We also need to think about a policy like taxation to the agents Strategy-Proof Mechanism Design, specifically the Vickrey-Clarke-Groves (VCG) payment rule. We link and prove that this would create an incentive for the dominant strategy to become truth-telling.
 \subsubsection{Pricing Mechanism Summary}
 We now present the complete pricing mechanism that integrates the behavioral separability, contamination estimation, and robust optimization components developed in the preceding sections. Algorithm~\ref{alg:phantom_pricing_loop} formalizes the defensive pricing loop as a Stackelberg game where the platform (leader) sets prices and the aggregate demand (follower) responds through observed session trajectories.
 \begin{algorithm}[t]
 \caption{PHANTOM defensive pricing loop (bachelor-thesis level)}
 \label{alg:phantom_loop_clean}
 \DontPrintSemicolon
 \SetKwInOut{Input}{Input}\SetKwInOut{Output}{Output}
 \Input{catalog size \(N\); costs \(c\); reference prices \(p^{ref}\); behavior models \(\bar T_H,\bar T_A\);
 action weights \(\omega\); penalty \(\lambda\); horizon \(T\); sessions per step \(M\)}
 \Output{price/demand trajectory \(\{(p_t,\hat Q_t,\hat\alpha_t)\}_{t=0}^{T-1}\)}
 Initialize contamination estimate \(\hat\alpha \leftarrow 0.2\)\;
 \For{\(t \leftarrow 0\) \KwTo \(T-1\)}{
  set \(p_t \leftarrow \pi(\cdot) \) %c + (1 - \kappa \hat\alpha)\,(p^{ref}-c)\)\;
  and clip \(p_t\) to a feasible range (e.g., near cost up to a max margin)\;
  \(\hat Q_t \leftarrow 0\), \(\mathcal S_t \leftarrow \emptyset\); \tcp{Observe sessions and compute demand proxy (Eq.~2)}
  \For{\(m \leftarrow 1\) \KwTo \(M\)}{
    sample a session trajectory \(\tau_m\) using \(\bar T_H\) or \(\bar T_A\)\;
    \(\hat Q_t \leftarrow \hat Q_t + \sum_{k}\omega(a_{m,k})\)\;
    \(\mathcal S_t \leftarrow \mathcal S_t \cup \{\tau_m\}\)\;
  }
  \tcp{Estimate contamination from behavioral separability}
  compute \(\hat\alpha \leftarrow \frac{1}{M}\sum_{\tau\in\mathcal S_t} \Big[\sigma\big(\beta(\Delta_H(\tau)-\Delta_A(\tau))\big)\Big]\)\;
  compute \(J_t \leftarrow \text{Revenue}(p_t,\hat Q_t) - \lambda\cdot \text{COILeak}(\hat\alpha)\)\;
 }
 \end{algorithm}
 The algorithm operates in discrete epochs indexed by $t$. At each epoch, the platform publishes prices (leader move), observes the resulting session trajectories (follower response), and updates its contamination estimate based on behavioral divergence from the learned human and agent transition kernels $\bar{\mathcal{T}}_H$ and $\bar{\mathcal{T}}_A$. The history buffer $\mathcal{L}$ (termed ``Limbo'' in our implementation) enforces the alternating Stackelberg structure by maintaining the temporal sequence of price publications and demand observations.
 %The defensive price update in Line 24 implements a contamination-aware margin shrinkage: as the estimated agent contamination $\hat{\alpha}_t$ increases, the margin $(p^{\mathrm{ref}} - c)$ is proportionally reduced by factor $\kappa \in [0,1]$, with projection $\Pi_{\mathcal{P}}$ ensuring prices remain within the feasible set $\mathcal{P}$. In subsequent experiments, this heuristic update is replaced by the DR-RL policy $\pi^*$ from Eq.~\ref{eq:robust_policy}, which optimizes against the Wasserstein ambiguity set $\mathcal{U}_\epsilon$ rather than relying on a fixed margin adjustment rule.
 \section{Heuristics as part of neuro-inspired steering systems}
 Steve Burns, superior culliculus (face heuristics) we create this sort of part of the 'brain' + amortized inference.