adding the markdown to auto

2026-07-15 17:43:36 +00:00 · 2026-03-16 15:30:09 +01:00
parent 2adb4f07b4
commit 43b952cf2b
1 changed files with 125 additions and 0 deletions
--- a/paper/src/chapters/auto/whoclicked_dataset_card.md
+++ b/paper/src/chapters/auto/whoclicked_dataset_card.md
@@ -0,0 +1,125 @@
 ---
 pretty_name: whoclickedit
 license: mit
 language:
 - en
 task_categories:
 - tabular-classification
 task_ids:
 - tabular-multi-class-classification
 tags:
 - e-commerce
 - dynamic-pricing
 - behavioral-telemetry
 - human-vs-agent
 - session-data
 size_categories:
 - 1K<n<10K
 ---
 # Dataset Card for whoclickedit
 ## Dataset Summary
 whoclickedit is an event-level behavioral dataset for human versus agent interaction analysis in dynamic pricing experiments.
 It merges interaction logs and price quote logs into one flat CSV (`whoclicked.csv`) with explicit labels for actor type.
 ## Dataset Snapshot
 - Rows: `3838`
 - Columns: `42`
 - Time range (UTC): `2025-12-05T09:43:31.301000+00:00` to `2026-02-28T19:32:06.444000+00:00`
 - Unique sessions by actor:
 - `agent`: 7
 - `human`: 25
 - Rows by actor:
 - `agent`: 3076
 - `human`: 762
 - Rows by record type:
 - `price_log`: 3331
 - `interaction`: 507
 - Rows by actor x record type:
 - `agent` / `interaction`: 197
 - `agent` / `price_log`: 2879
 - `human` / `interaction`: 310
 - `human` / `price_log`: 452
 - Store modes:
 - `hotel`: 3592
 - `airline`: 196
 - `shop`: 50
 ## Source and Processing
 Data is collected from two local roots in the PHANTOM project:
 - `experiments/collected_data` (human sessions)
 - `experiments/agents/collected_data` (agent sessions)
 Each session folder contains:
 - `int.json` (interaction events)
 - `price.json` (price quote logs)
 The ETL does the following:
 - Normalizes both Kafka-envelope and flat payload formats
 - Flattens nested metadata fields into `metadata_*` columns
 - Preserves all raw rows (no deduplication)
 - Adds labels:
  - `actor_type` in `{human, agent}`
  - `is_agent` in `{0, 1}`
  - `record_type` in `{interaction, price_log}`
 ## Data Fields
 Core fields used for modeling:
 - `actor_type`, `is_agent`, `record_type`
 - `sessionId`, `experimentId`, `storeMode`, `ts`
 - `eventName`, `page`, `productId`, `price`, `userAgent`
 Kafka provenance fields:
 - `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression`
 - `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*`
 Flattened metadata fields currently present:
 - `metadata_cabinClass`
 - `metadata_dateIndex`
 - `metadata_dwellTime`
 - `metadata_elementText`
 - `metadata_fareRule`
 - `metadata_flightType`
 - `metadata_itemCount`
 - `metadata_nights`
 - `metadata_price`
 - `metadata_referrer`
 - `metadata_roomType`
 - `metadata_total`
 - `metadata_type`
 Top interaction events:
 - `page_view`: 236
 - `learn_more_about_item`: 88
 - `view_item_page`: 85
 - `add_item_to_cart`: 46
 - `hover_over_title`: 23
 - `checkout_start`: 19
 - `hover_over_paragraph`: 6
 - `remove_item`: 4
 ## Intended Uses
 - Human-vs-agent traffic classification
 - Session-level behavioral modeling
 - Dynamic pricing robustness analysis under agent-mediated reconnaissance
 ## Out-of-Scope Uses
 - Identity inference or user-level profiling
 - Credit, employment, insurance, or legal decision making
 ## Data Splits
 No official train/validation/test split is provided in the current release.
 Users should create time-aware or session-aware splits to avoid leakage.
 ## Privacy and Sensitive Content
 - `userAgent` and referrer metadata can be quasi-identifying in small samples.
 - Use care before publishing derived artifacts that can re-identify participants.
 ## Limitations
 - Data is generated in a controlled experiment platform, not a full production marketplace.
 - Agent traffic currently reflects the configured tasking and browser automation setup.
 - Coverage is stronger for `hotel` than `airline` in the current release.
 ## Citation
 If you use this dataset, cite the PHANTOM thesis project and link this dataset page.