chore: updating datset card with releveant updates nad data

2026-07-15 17:43:36 +00:00 · 2026-03-23 13:28:31 +01:00
parent f70c51f223
commit 8706072966
2 changed files with 276 additions and 139 deletions
--- a/paper/src/chapters/auto/whoclicked_dataset_card.md
+++ b/paper/src/chapters/auto/whoclicked_dataset_card.md
@@ -17,64 +17,107 @@ size_categories:
 - 1K<n<10K
 ---
-# Dataset Card for whoclickedit
+<img align="right" width="280" src="https://raw.githubusercontent.com/velocitatem/PHANTOM/main/docs/static/images/banner.svg" alt="PHANTOM research banner" />
-## Dataset Summary
+# [whoclickedit](https://huggingface.co/datasets/velocitatem/whoclickedit)
 whoclickedit is an event-level behavioral dataset for human versus agent interaction analysis in dynamic pricing experiments.
 It merges interaction logs and price quote logs into one flat CSV (`whoclicked.csv`) with explicit labels for actor type.
-## Dataset Snapshot
+[![Dataset on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-sm.svg)](https://huggingface.co/datasets/velocitatem/whoclickedit)
- Rows: `3838`
+![Rows](https://img.shields.io/badge/Rows-3874-0A9396?style=flat-square)
- Columns: `42`
+![Columns](https://img.shields.io/badge/Columns-42-005F73?style=flat-square)
- Time range (UTC): `2025-12-05T09:43:31.301000+00:00` to `2026-02-28T19:32:06.444000+00:00`
+![Sessions](https://img.shields.io/badge/Sessions-36-1D3557?style=flat-square)
- Unique sessions by actor:
+![Human rows](https://img.shields.io/badge/Human%20rows-798-2A9D8F?style=flat-square)
- `agent`: 7
+![Agent rows](https://img.shields.io/badge/Agent%20rows-3076-E76F51?style=flat-square)
- `human`: 25
+![License](https://img.shields.io/badge/License-MIT-111827?style=flat-square)
- Rows by actor:
+
- `agent`: 3076
+> **Event-level behavior data for dynamic pricing research.**
- `human`: 762
+> This dataset captures how humans and automated agents browse, query prices, and move through the PHANTOM storefronts during controlled experiments.
- Rows by record type:
+
- `price_log`: 3331
+## What this dataset gives you
- `interaction`: 507
+
- Rows by actor x record type:
+- A single flat file (`whoclicked.csv`) with both interaction and price-log events.
- `agent` / `interaction`: 197
+- Explicit labels for actor origin: `actor_type` and `is_agent`.
- `agent` / `price_log`: 2879
+- Provenance fields from Kafka envelopes when available.
- `human` / `interaction`: 310
+- Metadata flattened into feature-ready `metadata_*` columns.
- `human` / `price_log`: 452
+
- Store modes:
+## Snapshot
- `hotel`: 3592
+
- `airline`: 196
+| Metric | Value |
- `shop`: 50
+| --- | --- |
 | Rows | `3874` |
 | Columns | `42` |
 | Time range (UTC) | `2025-12-05T09:43:31.301000+00:00` -> `2026-03-23T12:08:30.151000+00:00` |
 | Unique sessions | `36` |
 ## Composition
 ### Rows by actor
 | Actor | Rows | Share |
 | --- | --- | --- |
 | `human` | 798 | 20.6% |
 | `agent` | 3076 | 79.4% |
 ### Rows by actor and record type
 | Actor | Record type | Rows |
 | --- | --- | --- |
 | `agent` | `interaction` | 197 |
 | `agent` | `price_log` | 2879 |
 | `human` | `interaction` | 328 |
 | `human` | `price_log` | 470 |
 ### Store mode coverage
 | Store mode | Rows |
 | --- | --- |
 | `hotel` | 3628 |
 | `airline` | 196 |
 | `shop` | 50 |
 ### Top interaction events
 | Interaction event | Count |
 | --- | --- |
 | `page_view` | 246 |
 | `learn_more_about_item` | 91 |
 | `view_item_page` | 88 |
 | `add_item_to_cart` | 47 |
 | `hover_over_title` | 23 |
 | `checkout_start` | 20 |
 | `hover_over_paragraph` | 6 |
 | `remove_item` | 4 |
 ## Collection pipeline
 Data is sourced from two roots inside PHANTOM:
 ## Source and Processing
 Data is collected from two local roots in the PHANTOM project:
 - `experiments/collected_data` (human sessions)
 - `experiments/agents/collected_data` (agent sessions)
-Each session folder contains:
+Each session directory contains:
 - `int.json` (interaction events)
 - `price.json` (price quote logs)
-The ETL does the following:
+- `int.json`: user interaction events
- Normalizes both Kafka-envelope and flat payload formats
+- `price.json`: price quote observations
- Flattens nested metadata fields into `metadata_*` columns
+
- Preserves all raw rows (no deduplication)
+ETL behavior:
- Adds labels:
+
-  - `actor_type` in `{human, agent}`
+1. Accepts both Kafka-envelope records and flat payload records.
-  - `is_agent` in `{0, 1}`
+2. Flattens nested JSON to a tabular schema.
-  - `record_type` in `{interaction, price_log}`
+3. Preserves row-level provenance (`source_session_dir`, `source_row_index`, topic fields).
 4. Adds modeling labels (`actor_type`, `is_agent`, `record_type`).
 ## Schema highlights
 Core modeling fields:
 ## Data Fields
 Core fields used for modeling:
 - `actor_type`, `is_agent`, `record_type`
 - `sessionId`, `experimentId`, `storeMode`, `ts`
 - `eventName`, `page`, `productId`, `price`, `userAgent`
 Kafka provenance fields:
 - `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression`
 - `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*`
-Flattened metadata fields currently present:
+<details>
 <summary>Metadata columns in this release</summary>
 - `metadata_cabinClass`
 - `metadata_dateIndex`
 - `metadata_dwellTime`
@@ -89,37 +132,34 @@ Flattened metadata fields currently present:
 - `metadata_total`
 - `metadata_type`
-Top interaction events:
+</details>
 - `page_view`: 236
 - `learn_more_about_item`: 88
 - `view_item_page`: 85
 - `add_item_to_cart`: 46
 - `hover_over_title`: 23
 - `checkout_start`: 19
 - `hover_over_paragraph`: 6
 - `remove_item`: 4
-## Intended Uses
+## Quick start
 - Human-vs-agent traffic classification
 - Session-level behavioral modeling
 - Dynamic pricing robustness analysis under agent-mediated reconnaissance
-## Out-of-Scope Uses
+```python
- Identity inference or user-level profiling
+from datasets import load_dataset
 - Credit, employment, insurance, or legal decision making
-## Data Splits
+ds = load_dataset("velocitatem/whoclickedit")
-No official train/validation/test split is provided in the current release.
+```
 Users should create time-aware or session-aware splits to avoid leakage.
-## Privacy and Sensitive Content
+Recommended split strategy:
 - `userAgent` and referrer metadata can be quasi-identifying in small samples.
 - Use care before publishing derived artifacts that can re-identify participants.
-## Limitations
+- Prefer session-aware or time-aware splits.
- Data is generated in a controlled experiment platform, not a full production marketplace.
+- Do not split rows from the same `sessionId` across train and test.
- Agent traffic currently reflects the configured tasking and browser automation setup.
+
- Coverage is stronger for `hotel` than `airline` in the current release.
+## Intended use
 - Human-vs-agent behavior classification.
 - Session-level telemetry modeling for dynamic pricing defenses.
 - Robustness experiments under agent-mediated reconnaissance.
 ## Safety and limitations
 - `userAgent` and referrer metadata can be quasi-identifying in very small samples.
 - Data comes from a controlled research platform, not a full production marketplace.
 - Current release has stronger coverage for `hotel` flows than `airline` flows.
 ## Citation
-If you use this dataset, cite the PHANTOM thesis project and link this dataset page.
+
 If you use this dataset, cite the PHANTOM thesis project and link this page:
 `https://huggingface.co/datasets/velocitatem/whoclickedit`
--- a/scripts/whoclicked_card.py
+++ b/scripts/whoclicked_card.py
@@ -8,6 +8,7 @@ import os
 import sys
 from pathlib import Path
 from typing import Any
 from urllib.parse import quote
 import pandas as pd
 from huggingface_hub import HfApi
@@ -93,6 +94,28 @@ def _time_range(df: pd.DataFrame) -> tuple[str, str]:
    return ts.min().isoformat(), ts.max().isoformat()
 def _badge(label: str, value: str, color: str, logo: str | None = None) -> str:
    encoded_label = quote(label, safe="")
    encoded_value = quote(value, safe="")
    base = (
        "https://img.shields.io/badge/"
        f"{encoded_label}-{encoded_value}-{color}?style=flat-square"
    )
    if logo:
        base = f"{base}&logo={quote(logo, safe='')}&logoColor=white"
    return f"![{label}]({base})"
 def _md_table(headers: list[str], rows: list[list[str]]) -> str:
    header = f"| {' | '.join(headers)} |"
    separator = f"| {' | '.join('---' for _ in headers)} |"
    if not rows:
        empty = f"| {' | '.join('n/a' for _ in headers)} |"
        return "\n".join([header, separator, empty])
    body = "\n".join(f"| {' | '.join(row)} |" for row in rows)
    return "\n".join([header, separator, body])
 def _render_card(df: pd.DataFrame) -> str:
    total_rows = len(df)
    total_cols = len(df.columns)
@@ -112,31 +135,76 @@ def _render_card(df: pd.DataFrame) -> str:
    metadata_cols = sorted(c for c in df.columns if c.startswith("metadata_"))
-    actor_lines = (
+    total_sessions = sum(session_counts.values())
-        "\n".join(f"- `{k}`: {v}" for k, v in actor_counts.items()) or "- none"
+    human_rows = actor_counts.get("human", 0)
    agent_rows = actor_counts.get("agent", 0)
    top_events = list(event_counts.items())[:10]
    snapshot_table = _md_table(
        ["Metric", "Value"],
        [
            ["Rows", f"`{total_rows}`"],
            ["Columns", f"`{total_cols}`"],
            ["Time range (UTC)", f"`{t_min}` -> `{t_max}`"],
            ["Unique sessions", f"`{total_sessions}`"],
        ],
    )
-    record_lines = (
+
-        "\n".join(f"- `{k}`: {v}" for k, v in record_counts.items()) or "- none"
+    actor_table = _md_table(
        ["Actor", "Rows", "Share"],
        [
            [
                "`human`",
                str(human_rows),
                f"{(human_rows / total_rows * 100):.1f}%" if total_rows else "0.0%",
            ],
            [
                "`agent`",
                str(agent_rows),
                f"{(agent_rows / total_rows * 100):.1f}%" if total_rows else "0.0%",
            ],
        ],
    )
-    pair_lines = (
+
-        "\n".join(
+    pair_table = _md_table(
-            f"- `{a}` / `{r}`: {n}"
+        ["Actor", "Record type", "Rows"],
-            for (a, r), n in sorted(
+        [
            [f"`{actor}`", f"`{record}`", str(n)]
            for (actor, record), n in sorted(
                by_actor_record.items(), key=lambda x: (x[0][0], x[0][1])
            )
        ],
    )
-        or "- none"
+
    store_table = _md_table(
        ["Store mode", "Rows"],
        [
            [f"`{mode}`", str(n)]
            for mode, n in sorted(
                store_counts.items(), key=lambda x: x[1], reverse=True
            )
-    store_lines = (
+        ],
        "\n".join(f"- `{k}`: {v}" for k, v in store_counts.items()) or "- none"
    )
-    session_lines = (
+
-        "\n".join(f"- `{k}`: {v}" for k, v in session_counts.items()) or "- none"
+    event_table = _md_table(
        ["Interaction event", "Count"],
        [[f"`{name}`", str(n)] for name, n in top_events],
    )
-    top_events = list(event_counts.items())[:10]
+
    event_lines = "\n".join(f"- `{k}`: {v}" for k, v in top_events) or "- none"
    metadata_lines = "\n".join(f"- `{c}`" for c in metadata_cols) or "- none"
    dataset_badge = (
        "[![Dataset on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/"
        "dataset-on-hf-sm.svg)](https://huggingface.co/datasets/velocitatem/whoclickedit)"
    )
    rows_badge = _badge("Rows", str(total_rows), "0A9396")
    cols_badge = _badge("Columns", str(total_cols), "005F73")
    sessions_badge = _badge("Sessions", str(total_sessions), "1D3557")
    human_badge = _badge("Human rows", str(human_rows), "2A9D8F")
    agent_badge = _badge("Agent rows", str(agent_rows), "E76F51")
    license_badge = _badge("License", "MIT", "111827")
    return f"""---
 pretty_name: whoclickedit
 license: mit
@@ -156,85 +224,114 @@ size_categories:
 - {size_cat}
 ---
-# Dataset Card for whoclickedit
+<img align="right" width="280" src="https://raw.githubusercontent.com/velocitatem/PHANTOM/main/docs/static/images/banner.svg" alt="PHANTOM research banner" />
-## Dataset Summary
+# [whoclickedit](https://huggingface.co/datasets/velocitatem/whoclickedit)
 whoclickedit is an event-level behavioral dataset for human versus agent interaction analysis in dynamic pricing experiments.
 It merges interaction logs and price quote logs into one flat CSV (`whoclicked.csv`) with explicit labels for actor type.
-## Dataset Snapshot
+{dataset_badge}
- Rows: `{total_rows}`
+{rows_badge}
- Columns: `{total_cols}`
+{cols_badge}
- Time range (UTC): `{t_min}` to `{t_max}`
+{sessions_badge}
- Unique sessions by actor:
+{human_badge}
-{session_lines}
+{agent_badge}
- Rows by actor:
+{license_badge}
-{actor_lines}
+
- Rows by record type:
+> **Event-level behavior data for dynamic pricing research.**
-{record_lines}
+> This dataset captures how humans and automated agents browse, query prices, and move through the PHANTOM storefronts during controlled experiments.
- Rows by actor x record type:
+
-{pair_lines}
+## What this dataset gives you
- Store modes:
+
-{store_lines}
+- A single flat file (`whoclicked.csv`) with both interaction and price-log events.
 - Explicit labels for actor origin: `actor_type` and `is_agent`.
 - Provenance fields from Kafka envelopes when available.
 - Metadata flattened into feature-ready `metadata_*` columns.
 ## Snapshot
 {snapshot_table}
 ## Composition
 ### Rows by actor
 {actor_table}
 ### Rows by actor and record type
 {pair_table}
 ### Store mode coverage
 {store_table}
 ### Top interaction events
 {event_table}
 ## Collection pipeline
 Data is sourced from two roots inside PHANTOM:
 ## Source and Processing
 Data is collected from two local roots in the PHANTOM project:
 - `experiments/collected_data` (human sessions)
 - `experiments/agents/collected_data` (agent sessions)
-Each session folder contains:
+Each session directory contains:
 - `int.json` (interaction events)
 - `price.json` (price quote logs)
-The ETL does the following:
+- `int.json`: user interaction events
- Normalizes both Kafka-envelope and flat payload formats
+- `price.json`: price quote observations
- Flattens nested metadata fields into `metadata_*` columns
+
- Preserves all raw rows (no deduplication)
+ETL behavior:
- Adds labels:
+
-  - `actor_type` in `{{human, agent}}`
+1. Accepts both Kafka-envelope records and flat payload records.
-  - `is_agent` in `{{0, 1}}`
+2. Flattens nested JSON to a tabular schema.
-  - `record_type` in `{{interaction, price_log}}`
+3. Preserves row-level provenance (`source_session_dir`, `source_row_index`, topic fields).
 4. Adds modeling labels (`actor_type`, `is_agent`, `record_type`).
 ## Schema highlights
 Core modeling fields:
 ## Data Fields
 Core fields used for modeling:
 - `actor_type`, `is_agent`, `record_type`
 - `sessionId`, `experimentId`, `storeMode`, `ts`
 - `eventName`, `page`, `productId`, `price`, `userAgent`
 Kafka provenance fields:
 - `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression`
 - `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*`
-Flattened metadata fields currently present:
+<details>
 <summary>Metadata columns in this release</summary>
 {metadata_lines}
-Top interaction events:
+</details>
 {event_lines}
-## Intended Uses
+## Quick start
 - Human-vs-agent traffic classification
 - Session-level behavioral modeling
 - Dynamic pricing robustness analysis under agent-mediated reconnaissance
-## Out-of-Scope Uses
+```python
- Identity inference or user-level profiling
+from datasets import load_dataset
 - Credit, employment, insurance, or legal decision making
-## Data Splits
+ds = load_dataset("velocitatem/whoclickedit")
-No official train/validation/test split is provided in the current release.
+```
 Users should create time-aware or session-aware splits to avoid leakage.
-## Privacy and Sensitive Content
+Recommended split strategy:
 - `userAgent` and referrer metadata can be quasi-identifying in small samples.
 - Use care before publishing derived artifacts that can re-identify participants.
-## Limitations
+- Prefer session-aware or time-aware splits.
- Data is generated in a controlled experiment platform, not a full production marketplace.
+- Do not split rows from the same `sessionId` across train and test.
- Agent traffic currently reflects the configured tasking and browser automation setup.
+
- Coverage is stronger for `hotel` than `airline` in the current release.
+## Intended use
 - Human-vs-agent behavior classification.
 - Session-level telemetry modeling for dynamic pricing defenses.
 - Robustness experiments under agent-mediated reconnaissance.
 ## Safety and limitations
 - `userAgent` and referrer metadata can be quasi-identifying in very small samples.
 - Data comes from a controlled research platform, not a full production marketplace.
 - Current release has stronger coverage for `hotel` flows than `airline` flows.
 ## Citation
-If you use this dataset, cite the PHANTOM thesis project and link this dataset page.
+
 If you use this dataset, cite the PHANTOM thesis project and link this page:
 `https://huggingface.co/datasets/velocitatem/whoclickedit`
 """