# PHANTOM: setup for operators and partners This guide walks a team from **business context** (what you sell, how you price, what traffic you worry about) through a **running PHANTOM stack**, **behavioral kernels and contamination**, and **RL training / benchmarking**. The math lives in the thesis PDF; here we tie operations to that math without re-deriving it. References to the thesis use **chapter numbers** only (build the PDF locally if you need line-level citations). **Thesis (PDF):** [thesis-latest.pdf](https://pub-d5b94a3c29fd40c6b3881946e463fdb7.r2.dev/thesis-latest.pdf) --- ## 1. Who this is for / prerequisites **Audience:** Engineers and researchers who run Docker, a Next.js app, and Python tooling; product or risk stakeholders who define experiment goals and acceptable UX tradeoffs. **Skills:** Docker Compose, Node/npm, Python 3.8+, basic Kafka/Redis mental model. **Decide up front:** - **Vertical vs demo:** The repo ships `hotel` and `airline` storefront modes (`STORE_MODE`). Anything beyond that is custom integration work. - **Data residency:** Event streams and training artifacts default to paths under the repo (overridable via `PHANTOM_`* env vars in `lib/config.py`). Decide where logs and models may live before you point production-like traffic at the stack. - **Experiment governance:** Who may run human vs agent sessions, how sessions are labeled or weak-labeled for research, and retention policy for interaction logs. ### Theoretical implications The formal model assumes each session is generated by a latent **actor class** $Y \in H,A$ (human vs agent). Your deployment choices implicitly assert **which sessions are valid for estimating human vs agent behavior** and whether experimental conditions are stable. If you mix exploratory QA traffic with labeled experiments without recording that fact, you blur the empirical partitions $D_H$ and $D_A$ that the methodology needs for transition kernels and contamination studies. See the **Introduction** (research questions) and **Methodology**, Problem Formalization, in the thesis PDF. --- ## 2. Business fit framing **What PHANTOM is for:** Studying how **automated browsing and transaction orchestration** interact with **session-based pricing**: behavior generates a demand proxy $\hat{q}$; pricing policies map interaction history to prices; **Cost of Information (COI)** is the premium the platform can sustain above a floor when information is scarce. Agent-mediated **reconnaissance in one session** and **purchase in another** undermines that asymmetry; the thesis proves a **COI erosion** mechanism under many independent price queries. **What you must supply:** - A **product catalog** path: defaults assume Supabase-backed product data (`NEXT_PUBLIC_SUPABASE_URL`, `NEXT_PUBLIC_SUPABASE_ANON_KEY`). - A plan for **interaction and price events** reaching the ingestion path (backend → Kafka) or an adapter you maintain. - Clear **experiment goals:** e.g. compare human vs agent KPIs under the same task, measure margin under varying contamination $\alpha$. ### Theoretical implications Aggregate demand in the thesis is a **mixture** over human and agent types with contamination $\alpha$ plus noise $\epsilon_t$; see the mixture demand discussion in **Chapter 3 (Methodology)**. COI is defined as $\mathbb{E}[P]-\underline{p}$; the **COI framework** and theorem in the same chapter explain why saturated agent querying collapses extractable premium. Your business scenario determines which **actions** enter $\hat{q}$ and how interpretable $\alpha$ is for your traffic. --- ## 3. Environment and secrets **Bootstrap files (from repo root):** ```bash npm install cp .env.example .env cp .env.sweep.example .env.sweep ``` **Core `.env` (platform + web + docker):** See `[.env.example](.env.example)`. You must also set the variables called out in `[README.md](README.md)` for a full stack: `NEXT_PUBLIC_SUPABASE_URL`, `NEXT_PUBLIC_SUPABASE_ANON_KEY`, `AIRFLOW_FERNET_KEY`, `AIRFLOW_SECRET_KEY` (and provider ports per your compose file). **Training / sweeps (`.env.sweep`):** Used by `make train`, `make benchmark`, sweep agents. Typically `WANDB_API_KEY`, optional `WANDB_ENTITY` / `WANDB_PROJECT`, `GITHUB_TOKEN` for bootstrap flows, `SWEEP_ID` for W&B sweep workers. See `[.env.sweep.example](.env.sweep.example)`. **Security:** Never commit real `.env` or `.env.sweep` files. Rotate keys if they leak. ### Theoretical implications Splitting **online platform credentials** (ingestion, catalog, Kafka) from **offline training credentials** (W&B, cloud TPUs, GitHub tokens for workers) mirrors the **hybrid Kappa–Lambda** data loop in the thesis: streaming observation vs batch / long-running training jobs. That split is named in the **Terminology** appendix of the thesis PDF. --- ## 4. Bring-up (commands) Aligned with `[README.md](README.md)`: ```bash npm install cp .env.example .env cp .env.sweep.example .env.sweep # edit .env: Supabase, Airflow keys, etc. make platform.up make web.dev ``` **Sanity checks:** | Endpoint | Role | | ------------------------------------------------------------- | --------------------------------- | | `http://localhost:3000` | Next.js storefront | | `http://localhost:5000/health` | Backend ingest API | | `http://localhost:5001/health` | Pricing provider | | `http://localhost:8085` | Airflow UI (default compose port) | | `http://localhost:8084` or configured `REDPANDA_CONSOLE_PORT` | Kafka console (see your `.env`) | **Optional tests:** `make test.backend` (with venv/tooling as in Makefile); `make test.e2e` requires backend, web, and Airflow up per README. ### Theoretical implications A correctly wired stack logs **trajectories** $\tau_s$ (sequences of events) and **price exposure** together. **Chapter 3** defines events $e_{s,k}=(a,i,t)$ and proxies $\hat{q}$ from weighted actions—without joint logging of behavior and quotes, you cannot recover the objects the theory reasons about (Problem Formalization). --- ## 5. Service map ```mermaid flowchart LR U[Human / Agent Browser] --> W[Next.js Web App] W -->|Price requests| P[Pricing Provider] W -->|Interaction events| B[Backend Ingest API] B --> K[Kafka] K --> A[Airflow + Worker Jobs] A --> R[Redis Model Registry] P -->|Session/global prices| W E[Research Engine + Experiments] --> A E --> R ``` **Ports (typical; confirm in `docker-compose` and `.env`):** `BACKEND_PORT` (5000), `PROVIDER_PORT` (5001), `KAFKA_PORT`, `REDIS_PORT`, Airflow `AIRFLOW_WEBSERVER_PORT` (8085 default), Redpanda console. ### Theoretical implications The platform **observes** behavioral proxies and quoted prices, not the latent demand curve $d(p\mid\theta)$. The distinction between $\hat{q}$ and true demand is explicit in **Chapter 3**. Misattributing proxy noise to “true” elasticity breaks both estimation and any causal story about COI. --- ## 6. Tailoring to your business **Storefront mode:** `STORE_MODE=hotel` or `airline` (see `[web/src/lib/config.ts](web/src/lib/config.ts)` and env). This switches catalog and UI, not the core ingestion pattern. **API base / environment:** `NEXT_PUBLIC_API_BASE`, `NEXT_PUBLIC_APP_ENV` (validated in `config.ts`). **Paths for data and runs:** Override with `PHANTOM_DATA_DIR`, `PHANTOM_SIM_RUNS_DIR`, `PHANTOM_MODEL_REGISTRY_DIR`, `PHANTOM_COLLECTED_DATA_DIR`, etc. (`[lib/config.py](lib/config.py)`). **Honest scope:** A new vertical (custom product ontology, checkout rules, pricing rules) means **new UI, events, and possibly new reward features** in the engine. Budget engineering time; the repo is a research platform, not a turnkey SaaS skin for arbitrary catalogs without code changes. ### Theoretical implications Transition kernels $\hat{\mathcal{T}}_H,\hat{\mathcal{T}}_A$ are estimated on a **finite action / state space** derived from your instrumentation. Changing catalog depth or event taxonomy changes the MDP state space; old kernel estimates are not portable. See the transition kernel discussion in **Chapter 3**. --- ## 7. Data collection and experiments **Flow:** Browser → backend → **Kafka** → downstream consumers (Airflow DAGs, notebooks, ETL under `experiments/`). Ensure **session identity**, **item identifiers**, and **action types** are consistent enough to build trajectories. **Weak labels:** The thesis discusses partitioning data into human vs agent subsets for MLE transition counts. In production you may only have heuristic labels—document bias explicitly. ### Theoretical implications Distinguishability (sub-question SQ1 in the **Introduction**) asks whether $H$ vs $A$ is identifiable from behavior alone. Your labeling and experimental design determine whether $\Delta_H,\Delta_A$ and $f(\tau)$ are meaningful or dominated by noise. Symbols appear in the **Terminology** appendix ($\Delta_H,\Delta_A$, $f(\tau)$, contamination generator $\mathcal{G}(\alpha)$). --- ## 8. Transition kernels and agent scoring (theory → practice) **Theory:** Sessions yield trajectories $\tau_s$. For each actor class $y\inH,A$, the thesis estimates a **Markov transition kernel** by counting transitions and normalizing (MLE): $$ \hat{P}(s' \mid s) = \frac{N(s,s')}{\sum_k N(s,k)} $$ Human and agent prototypes $\hat{\mathcal{T}}_H,\hat{\mathcal{T}}_A$ support comparing an empirical kernel from a partial trajectory to prototypes (e.g. KL-style divergences $\Delta_H,\Delta_A$) and mapping to a **weak agent probability** $f(\tau)$. See **Chapter 3** and the **Terminology** appendix. **Code:** `[engine/lib/coi.py](engine/lib/coi.py)` (`compute_agent_probability`: empirical transition counts vs human/agent reference dicts, KL-style terms, mapped via `[lib/agent_probability.py](lib/agent_probability.py)`). **Optional narrative:** `[blog/02-behavioral-fingerprinting.md](blog/02-behavioral-fingerprinting.md)` walks a concrete study design (not required for operators). ### Theoretical implications If reference kernels are fit on **stale** or **mislabeled** partitions, $\Delta_H-\Delta_A$ is not interpretable as distinguishability. Ground claims in SQ1 (**Introduction**) and the kernel subsection of **Chapter 3**. --- ## 9. Contamination generator $\mathcal{G}(\alpha)$ **Theory:** Given clean trajectories, $\mathcal{G}(\alpha)$ injects synthetic agent trajectories until the effective mixture reaches contamination $\alpha\in[0,1]$, defining training scenarios for robust policies (**Chapter 3**). Catalog-scale block expansion of kernels is discussed there with validation caveats—treat large product spaces as **research-grade** until your team signs off. **Code:** `[engine/engine.py](engine/engine.py)` — `MarketEngine` mixes human/agent demand, uses `get_adjusted_transitions` / `sample_behavior_from_transitions`, and `alpha` when combining actor types and building demand proxies (`estimate_demand`). This is the **simulator** path, not a drop-in replacement for your production database. ### Theoretical implications $\alpha$ in mixture $Q(p)$ is **agentic demand contribution** in the formal model, not necessarily “bot share of page views” unless your instrumentation equates them. Mismeasuring $\alpha$ biases robust objectives tied to a fixed contamination level. --- ## 10. Training and evaluation — local workflow **Environment:** Python venv via Nx (`make install` / `nx run research:install`). Training commands load `.env.sweep`. ```bash make train LOCAL_TRAIN_ARGS='--algo ppo --total-timesteps 50000' make benchmark LOCAL_BENCHMARK_ARGS='--tiers static,surge,linear,qtable,ppo --alpha-values 0.0,0.3 --episodes 3 --no-wandb' make benchmark.simple ``` Entrypoints: `[engine/train.py](engine/train.py)`, `[engine/benchmark.py](engine/benchmark.py)`, `[engine/spec.py](engine/spec.py)` (Nx wraps these—see `project.json` / research targets). **Artifacts:** `[lib/config.py](lib/config.py)` — `PHANTOM_SIM_RUNS_DIR` (default `sim/rl/runs`), `PHANTOM_MODEL_REGISTRY_DIR`, etc. **TensorBoard (optional):** `[docker-compose.yml](docker-compose.yml)` includes `tensorboard-rl` on host port **6007** (`./sim/rl/runs`) and `tensorboard-ml` on **6006** (`./experiments/ml/runs`). ### Theoretical implications Local runs instantiate the **offline defense gym**: policies trained on simulator-induced distributions approximate the DR-RL narrative in **Chapter 3**, but hyperparameters ($\lambda$ on COI leakage, $\eta$ on UX, robust radius) change the effective ambiguity set. Cross-check `engine/` against the thesis before claiming figure-for-figure replication. --- ## 11. Training and evaluation — remote / scaled deployment For **research at scale** (cloud quota and secrets required): | Mechanism | Role | | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | `[submit_ray_job.sh](submit_ray_job.sh)` | Ray jobs with `.env` injected; `RAY_MODE=single|distributed|benchmark|sweep`. Set the script’s `ROOT` to your clone path. | | `make tpu.ray.bootstrap` / `tpu.ray.`* | TPU Ray bootstrap (`TPU_CONF`, e.g. `tpu_orchestration/configs/v4_spot_us.conf`). | | `make train.agent` / `make benchmark.agent` | W&B sweeps: `SWEEP_ID` in `.env.sweep`. | | `make train.bootstrap` | Worker bootstrap: `REPO_URL`, `SWEEP_ID`, `GITHUB_TOKEN`. | | `make docker.train.publish` | Trainer image (`TRAIN_IMAGE_REF` in Makefile). | See `submit_ray_job.sh` for env vars (`WANDB_*`, `PHANTOM_*` TPU toggles). ### Theoretical implications Distributed training does not change the **definitions** of the Stackelberg game or Wasserstein ambiguity; it changes compute and variance of empirical estimates. Align random seeds and data protocol across nodes or split results explicitly—otherwise you mix distributions in a way a single empirical law $\hat{P}_N$ in the thesis does not describe. --- ## 12. Evaluation, artifacts, and audit trail **Benchmarks:** `make benchmark`* sweeps tiers and $\alpha$; CLI includes robustness knobs (see default `BENCHMARK_ARGS` in `submit_ray_job.sh`: `--robust-radius`, `--lambda-coi`, `--eta-ux`, etc.). **Audit trail:** Store `git` SHA, CLI argv, non-secret `.env.sweep` keys, and W&B run IDs with published tables. For scientific claims, cite **Chapters 4–5 (Results, Discussion)** in the thesis PDF. ### Theoretical implications Evaluation quality equals **simulator fidelity** plus **contamination modeling**. Separate theorem statements (assumption-based) from empirical curves (`engine`-dependent). --- ## 13. Operational suggestions - **Staging:** Non-production namespaces; separate Kafka topics and Supabase projects where possible. - **Rate limits / abuse:** Protect ingest endpoints; respect participant privacy. - **Human vs agent sessions:** Comparable cohorts; record experimental condition in metadata. - **Contracts:** `tests/e2e/` encodes minimal flows—use when APIs change. ### Theoretical implications Non-stationary noise $\epsilon_t$ and drifting $\alpha$ confound benchmark interpretation. **Chapter 3** discusses mixture identification: isolate treatments when possible and document confounders when not. --- ## 14. Roadmap / gaps (honesty) **Relatively turnkey:** Local dockerized stack, demo verticals, engine benchmarks, documented env and paths. **Typically custom:** Production catalog without Supabase, identity/fraud layers, legal review of logging, Kafka/Airflow SLAs, hardening the pricing provider for real money. **Thesis vs code:** The PDF is the **spec**; not every robustness term or large-catalog kernel construction is production-verified—see caveats in **Chapter 3**. ### Theoretical implications Theorems in the thesis can be **stronger** than what observational firm logs support. The COI result assumes a clean experimental reading of the pricing policy; live market data may only support weaker claims. --- ## 15. Theory and thesis cross-references (quick index) Use the **PDF table of contents** with these anchors: | Topic | Thesis location | | -------------------------------------------------------------------------- | ----------------------------------------------------- | | Research questions (margin, distinguishability, contamination, mitigation) | **Introduction** | | Sessions, events, $\hat{q}$, mixture $Q(p)$, $\alpha$ | **Chapter 3** — Problem Formalization, mixture demand | | COI definition and erosion theorem | **Chapter 3** — COI framework | | Transition kernels, MLE, $\mathcal{G}(\alpha)$ | **Chapter 3** | | DR-RL, ambiguity sets, Stackelberg | **Chapter 3** | | Symbol glossary (COI leakage, $f(\tau)$, UX, surrogates) | **Appendix — Terminology** | | Empirical results and limitations | **Chapters 4–5** | --- ## 16. Quick file index (code) | File | Role | | ---------------------------------------------------------------------------------- | -------------------------------------------------- | | `[engine/lib/coi.py](engine/lib/coi.py)` | KL-style trajectory comparison; agent probability. | | `[engine/engine.py](engine/engine.py)` | `MarketEngine`, mixture, demand proxy path. | | `[lib/agent_probability.py](lib/agent_probability.py)` | Divergence → probability score. | | `[lib/config.py](lib/config.py)` | Paths and ports for artifacts. | | `[engine/train.py](engine/train.py)`, `[engine/benchmark.py](engine/benchmark.py)` | CLI entrypoints. | | `[tpu_orchestration/](tpu_orchestration/)` | TPU configs and helpers. | You do **not** need a running storefront for many **offline** benchmarks if the research Python environment is installed; you **do** need aligned instrumentation to connect production trajectories to kernel estimation.