chore: updating datset card with releveant updates nad data

This commit is contained in:
2026-03-23 13:28:31 +01:00
parent f70c51f223
commit 8706072966
2 changed files with 276 additions and 139 deletions

View File

@@ -17,64 +17,107 @@ size_categories:
- 1K<n<10K - 1K<n<10K
--- ---
# Dataset Card for whoclickedit <img align="right" width="280" src="https://raw.githubusercontent.com/velocitatem/PHANTOM/main/docs/static/images/banner.svg" alt="PHANTOM research banner" />
## Dataset Summary # [whoclickedit](https://huggingface.co/datasets/velocitatem/whoclickedit)
whoclickedit is an event-level behavioral dataset for human versus agent interaction analysis in dynamic pricing experiments.
It merges interaction logs and price quote logs into one flat CSV (`whoclicked.csv`) with explicit labels for actor type.
## Dataset Snapshot [![Dataset on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-sm.svg)](https://huggingface.co/datasets/velocitatem/whoclickedit)
- Rows: `3838` ![Rows](https://img.shields.io/badge/Rows-3874-0A9396?style=flat-square)
- Columns: `42` ![Columns](https://img.shields.io/badge/Columns-42-005F73?style=flat-square)
- Time range (UTC): `2025-12-05T09:43:31.301000+00:00` to `2026-02-28T19:32:06.444000+00:00` ![Sessions](https://img.shields.io/badge/Sessions-36-1D3557?style=flat-square)
- Unique sessions by actor: ![Human rows](https://img.shields.io/badge/Human%20rows-798-2A9D8F?style=flat-square)
- `agent`: 7 ![Agent rows](https://img.shields.io/badge/Agent%20rows-3076-E76F51?style=flat-square)
- `human`: 25 ![License](https://img.shields.io/badge/License-MIT-111827?style=flat-square)
- Rows by actor:
- `agent`: 3076 > **Event-level behavior data for dynamic pricing research.**
- `human`: 762 > This dataset captures how humans and automated agents browse, query prices, and move through the PHANTOM storefronts during controlled experiments.
- Rows by record type:
- `price_log`: 3331 ## What this dataset gives you
- `interaction`: 507
- Rows by actor x record type: - A single flat file (`whoclicked.csv`) with both interaction and price-log events.
- `agent` / `interaction`: 197 - Explicit labels for actor origin: `actor_type` and `is_agent`.
- `agent` / `price_log`: 2879 - Provenance fields from Kafka envelopes when available.
- `human` / `interaction`: 310 - Metadata flattened into feature-ready `metadata_*` columns.
- `human` / `price_log`: 452
- Store modes: ## Snapshot
- `hotel`: 3592
- `airline`: 196 | Metric | Value |
- `shop`: 50 | --- | --- |
| Rows | `3874` |
| Columns | `42` |
| Time range (UTC) | `2025-12-05T09:43:31.301000+00:00` -> `2026-03-23T12:08:30.151000+00:00` |
| Unique sessions | `36` |
## Composition
### Rows by actor
| Actor | Rows | Share |
| --- | --- | --- |
| `human` | 798 | 20.6% |
| `agent` | 3076 | 79.4% |
### Rows by actor and record type
| Actor | Record type | Rows |
| --- | --- | --- |
| `agent` | `interaction` | 197 |
| `agent` | `price_log` | 2879 |
| `human` | `interaction` | 328 |
| `human` | `price_log` | 470 |
### Store mode coverage
| Store mode | Rows |
| --- | --- |
| `hotel` | 3628 |
| `airline` | 196 |
| `shop` | 50 |
### Top interaction events
| Interaction event | Count |
| --- | --- |
| `page_view` | 246 |
| `learn_more_about_item` | 91 |
| `view_item_page` | 88 |
| `add_item_to_cart` | 47 |
| `hover_over_title` | 23 |
| `checkout_start` | 20 |
| `hover_over_paragraph` | 6 |
| `remove_item` | 4 |
## Collection pipeline
Data is sourced from two roots inside PHANTOM:
## Source and Processing
Data is collected from two local roots in the PHANTOM project:
- `experiments/collected_data` (human sessions) - `experiments/collected_data` (human sessions)
- `experiments/agents/collected_data` (agent sessions) - `experiments/agents/collected_data` (agent sessions)
Each session folder contains: Each session directory contains:
- `int.json` (interaction events)
- `price.json` (price quote logs)
The ETL does the following: - `int.json`: user interaction events
- Normalizes both Kafka-envelope and flat payload formats - `price.json`: price quote observations
- Flattens nested metadata fields into `metadata_*` columns
- Preserves all raw rows (no deduplication) ETL behavior:
- Adds labels:
- `actor_type` in `{human, agent}` 1. Accepts both Kafka-envelope records and flat payload records.
- `is_agent` in `{0, 1}` 2. Flattens nested JSON to a tabular schema.
- `record_type` in `{interaction, price_log}` 3. Preserves row-level provenance (`source_session_dir`, `source_row_index`, topic fields).
4. Adds modeling labels (`actor_type`, `is_agent`, `record_type`).
## Schema highlights
Core modeling fields:
## Data Fields
Core fields used for modeling:
- `actor_type`, `is_agent`, `record_type` - `actor_type`, `is_agent`, `record_type`
- `sessionId`, `experimentId`, `storeMode`, `ts` - `sessionId`, `experimentId`, `storeMode`, `ts`
- `eventName`, `page`, `productId`, `price`, `userAgent` - `eventName`, `page`, `productId`, `price`, `userAgent`
Kafka provenance fields: Kafka provenance fields:
- `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression` - `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression`
- `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*` - `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*`
Flattened metadata fields currently present: <details>
<summary>Metadata columns in this release</summary>
- `metadata_cabinClass` - `metadata_cabinClass`
- `metadata_dateIndex` - `metadata_dateIndex`
- `metadata_dwellTime` - `metadata_dwellTime`
@@ -89,37 +132,34 @@ Flattened metadata fields currently present:
- `metadata_total` - `metadata_total`
- `metadata_type` - `metadata_type`
Top interaction events: </details>
- `page_view`: 236
- `learn_more_about_item`: 88
- `view_item_page`: 85
- `add_item_to_cart`: 46
- `hover_over_title`: 23
- `checkout_start`: 19
- `hover_over_paragraph`: 6
- `remove_item`: 4
## Intended Uses ## Quick start
- Human-vs-agent traffic classification
- Session-level behavioral modeling
- Dynamic pricing robustness analysis under agent-mediated reconnaissance
## Out-of-Scope Uses ```python
- Identity inference or user-level profiling from datasets import load_dataset
- Credit, employment, insurance, or legal decision making
## Data Splits ds = load_dataset("velocitatem/whoclickedit")
No official train/validation/test split is provided in the current release. ```
Users should create time-aware or session-aware splits to avoid leakage.
## Privacy and Sensitive Content Recommended split strategy:
- `userAgent` and referrer metadata can be quasi-identifying in small samples.
- Use care before publishing derived artifacts that can re-identify participants.
## Limitations - Prefer session-aware or time-aware splits.
- Data is generated in a controlled experiment platform, not a full production marketplace. - Do not split rows from the same `sessionId` across train and test.
- Agent traffic currently reflects the configured tasking and browser automation setup.
- Coverage is stronger for `hotel` than `airline` in the current release. ## Intended use
- Human-vs-agent behavior classification.
- Session-level telemetry modeling for dynamic pricing defenses.
- Robustness experiments under agent-mediated reconnaissance.
## Safety and limitations
- `userAgent` and referrer metadata can be quasi-identifying in very small samples.
- Data comes from a controlled research platform, not a full production marketplace.
- Current release has stronger coverage for `hotel` flows than `airline` flows.
## Citation ## Citation
If you use this dataset, cite the PHANTOM thesis project and link this dataset page.
If you use this dataset, cite the PHANTOM thesis project and link this page:
`https://huggingface.co/datasets/velocitatem/whoclickedit`

View File

@@ -8,6 +8,7 @@ import os
import sys import sys
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
from urllib.parse import quote
import pandas as pd import pandas as pd
from huggingface_hub import HfApi from huggingface_hub import HfApi
@@ -93,6 +94,28 @@ def _time_range(df: pd.DataFrame) -> tuple[str, str]:
return ts.min().isoformat(), ts.max().isoformat() return ts.min().isoformat(), ts.max().isoformat()
def _badge(label: str, value: str, color: str, logo: str | None = None) -> str:
encoded_label = quote(label, safe="")
encoded_value = quote(value, safe="")
base = (
"https://img.shields.io/badge/"
f"{encoded_label}-{encoded_value}-{color}?style=flat-square"
)
if logo:
base = f"{base}&logo={quote(logo, safe='')}&logoColor=white"
return f"![{label}]({base})"
def _md_table(headers: list[str], rows: list[list[str]]) -> str:
header = f"| {' | '.join(headers)} |"
separator = f"| {' | '.join('---' for _ in headers)} |"
if not rows:
empty = f"| {' | '.join('n/a' for _ in headers)} |"
return "\n".join([header, separator, empty])
body = "\n".join(f"| {' | '.join(row)} |" for row in rows)
return "\n".join([header, separator, body])
def _render_card(df: pd.DataFrame) -> str: def _render_card(df: pd.DataFrame) -> str:
total_rows = len(df) total_rows = len(df)
total_cols = len(df.columns) total_cols = len(df.columns)
@@ -112,31 +135,76 @@ def _render_card(df: pd.DataFrame) -> str:
metadata_cols = sorted(c for c in df.columns if c.startswith("metadata_")) metadata_cols = sorted(c for c in df.columns if c.startswith("metadata_"))
actor_lines = ( total_sessions = sum(session_counts.values())
"\n".join(f"- `{k}`: {v}" for k, v in actor_counts.items()) or "- none" human_rows = actor_counts.get("human", 0)
agent_rows = actor_counts.get("agent", 0)
top_events = list(event_counts.items())[:10]
snapshot_table = _md_table(
["Metric", "Value"],
[
["Rows", f"`{total_rows}`"],
["Columns", f"`{total_cols}`"],
["Time range (UTC)", f"`{t_min}` -> `{t_max}`"],
["Unique sessions", f"`{total_sessions}`"],
],
) )
record_lines = (
"\n".join(f"- `{k}`: {v}" for k, v in record_counts.items()) or "- none" actor_table = _md_table(
["Actor", "Rows", "Share"],
[
[
"`human`",
str(human_rows),
f"{(human_rows / total_rows * 100):.1f}%" if total_rows else "0.0%",
],
[
"`agent`",
str(agent_rows),
f"{(agent_rows / total_rows * 100):.1f}%" if total_rows else "0.0%",
],
],
) )
pair_lines = (
"\n".join( pair_table = _md_table(
f"- `{a}` / `{r}`: {n}" ["Actor", "Record type", "Rows"],
for (a, r), n in sorted( [
[f"`{actor}`", f"`{record}`", str(n)]
for (actor, record), n in sorted(
by_actor_record.items(), key=lambda x: (x[0][0], x[0][1]) by_actor_record.items(), key=lambda x: (x[0][0], x[0][1])
) )
],
) )
or "- none"
store_table = _md_table(
["Store mode", "Rows"],
[
[f"`{mode}`", str(n)]
for mode, n in sorted(
store_counts.items(), key=lambda x: x[1], reverse=True
) )
store_lines = ( ],
"\n".join(f"- `{k}`: {v}" for k, v in store_counts.items()) or "- none"
) )
session_lines = (
"\n".join(f"- `{k}`: {v}" for k, v in session_counts.items()) or "- none" event_table = _md_table(
["Interaction event", "Count"],
[[f"`{name}`", str(n)] for name, n in top_events],
) )
top_events = list(event_counts.items())[:10]
event_lines = "\n".join(f"- `{k}`: {v}" for k, v in top_events) or "- none"
metadata_lines = "\n".join(f"- `{c}`" for c in metadata_cols) or "- none" metadata_lines = "\n".join(f"- `{c}`" for c in metadata_cols) or "- none"
dataset_badge = (
"[![Dataset on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/"
"dataset-on-hf-sm.svg)](https://huggingface.co/datasets/velocitatem/whoclickedit)"
)
rows_badge = _badge("Rows", str(total_rows), "0A9396")
cols_badge = _badge("Columns", str(total_cols), "005F73")
sessions_badge = _badge("Sessions", str(total_sessions), "1D3557")
human_badge = _badge("Human rows", str(human_rows), "2A9D8F")
agent_badge = _badge("Agent rows", str(agent_rows), "E76F51")
license_badge = _badge("License", "MIT", "111827")
return f"""--- return f"""---
pretty_name: whoclickedit pretty_name: whoclickedit
license: mit license: mit
@@ -156,85 +224,114 @@ size_categories:
- {size_cat} - {size_cat}
--- ---
# Dataset Card for whoclickedit <img align="right" width="280" src="https://raw.githubusercontent.com/velocitatem/PHANTOM/main/docs/static/images/banner.svg" alt="PHANTOM research banner" />
## Dataset Summary # [whoclickedit](https://huggingface.co/datasets/velocitatem/whoclickedit)
whoclickedit is an event-level behavioral dataset for human versus agent interaction analysis in dynamic pricing experiments.
It merges interaction logs and price quote logs into one flat CSV (`whoclicked.csv`) with explicit labels for actor type.
## Dataset Snapshot {dataset_badge}
- Rows: `{total_rows}` {rows_badge}
- Columns: `{total_cols}` {cols_badge}
- Time range (UTC): `{t_min}` to `{t_max}` {sessions_badge}
- Unique sessions by actor: {human_badge}
{session_lines} {agent_badge}
- Rows by actor: {license_badge}
{actor_lines}
- Rows by record type: > **Event-level behavior data for dynamic pricing research.**
{record_lines} > This dataset captures how humans and automated agents browse, query prices, and move through the PHANTOM storefronts during controlled experiments.
- Rows by actor x record type:
{pair_lines} ## What this dataset gives you
- Store modes:
{store_lines} - A single flat file (`whoclicked.csv`) with both interaction and price-log events.
- Explicit labels for actor origin: `actor_type` and `is_agent`.
- Provenance fields from Kafka envelopes when available.
- Metadata flattened into feature-ready `metadata_*` columns.
## Snapshot
{snapshot_table}
## Composition
### Rows by actor
{actor_table}
### Rows by actor and record type
{pair_table}
### Store mode coverage
{store_table}
### Top interaction events
{event_table}
## Collection pipeline
Data is sourced from two roots inside PHANTOM:
## Source and Processing
Data is collected from two local roots in the PHANTOM project:
- `experiments/collected_data` (human sessions) - `experiments/collected_data` (human sessions)
- `experiments/agents/collected_data` (agent sessions) - `experiments/agents/collected_data` (agent sessions)
Each session folder contains: Each session directory contains:
- `int.json` (interaction events)
- `price.json` (price quote logs)
The ETL does the following: - `int.json`: user interaction events
- Normalizes both Kafka-envelope and flat payload formats - `price.json`: price quote observations
- Flattens nested metadata fields into `metadata_*` columns
- Preserves all raw rows (no deduplication) ETL behavior:
- Adds labels:
- `actor_type` in `{{human, agent}}` 1. Accepts both Kafka-envelope records and flat payload records.
- `is_agent` in `{{0, 1}}` 2. Flattens nested JSON to a tabular schema.
- `record_type` in `{{interaction, price_log}}` 3. Preserves row-level provenance (`source_session_dir`, `source_row_index`, topic fields).
4. Adds modeling labels (`actor_type`, `is_agent`, `record_type`).
## Schema highlights
Core modeling fields:
## Data Fields
Core fields used for modeling:
- `actor_type`, `is_agent`, `record_type` - `actor_type`, `is_agent`, `record_type`
- `sessionId`, `experimentId`, `storeMode`, `ts` - `sessionId`, `experimentId`, `storeMode`, `ts`
- `eventName`, `page`, `productId`, `price`, `userAgent` - `eventName`, `page`, `productId`, `price`, `userAgent`
Kafka provenance fields: Kafka provenance fields:
- `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression` - `kafka_partition_id`, `kafka_offset`, `kafka_timestamp_ms`, `kafka_compression`
- `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*` - `kafka_is_transactional`, `kafka_headers`, `kafka_key_*`, `kafka_value_*`
Flattened metadata fields currently present: <details>
<summary>Metadata columns in this release</summary>
{metadata_lines} {metadata_lines}
Top interaction events: </details>
{event_lines}
## Intended Uses ## Quick start
- Human-vs-agent traffic classification
- Session-level behavioral modeling
- Dynamic pricing robustness analysis under agent-mediated reconnaissance
## Out-of-Scope Uses ```python
- Identity inference or user-level profiling from datasets import load_dataset
- Credit, employment, insurance, or legal decision making
## Data Splits ds = load_dataset("velocitatem/whoclickedit")
No official train/validation/test split is provided in the current release. ```
Users should create time-aware or session-aware splits to avoid leakage.
## Privacy and Sensitive Content Recommended split strategy:
- `userAgent` and referrer metadata can be quasi-identifying in small samples.
- Use care before publishing derived artifacts that can re-identify participants.
## Limitations - Prefer session-aware or time-aware splits.
- Data is generated in a controlled experiment platform, not a full production marketplace. - Do not split rows from the same `sessionId` across train and test.
- Agent traffic currently reflects the configured tasking and browser automation setup.
- Coverage is stronger for `hotel` than `airline` in the current release. ## Intended use
- Human-vs-agent behavior classification.
- Session-level telemetry modeling for dynamic pricing defenses.
- Robustness experiments under agent-mediated reconnaissance.
## Safety and limitations
- `userAgent` and referrer metadata can be quasi-identifying in very small samples.
- Data comes from a controlled research platform, not a full production marketplace.
- Current release has stronger coverage for `hotel` flows than `airline` flows.
## Citation ## Citation
If you use this dataset, cite the PHANTOM thesis project and link this dataset page.
If you use this dataset, cite the PHANTOM thesis project and link this page:
`https://huggingface.co/datasets/velocitatem/whoclickedit`
""" """