Files
PHANTOM/experiments/procesing/steps/chunk.py
Daniel Alves Rösel ad9423bf59 Airflow addition (#28)
* introducing airflow to run pipeline

* chore: updating dag with upload to registry

* introducing complete provider (non refactored and noisy)

* chore: removing old shit

* generic pricing baselines

* feature: super simple model registry (to be updated maybe third party OS software)

* chore: refactoring the providers docker config and requirements

* chore: refactored and broke down components (braking

* exporting all

* local pipeline excution working

* fix: fixing import structures from nonrelativistic

* chore: enables cross comm pickling with fully e2e pipeline compilation

* docs: what the pipeline is like now

* pipelines local running and pipeline high level definition

* cleaning old pipeline and vectorization

* leaked but fixing, not so important

* test: started with pipeline step testing

* chore: cleaning up provider of prices

* test: extra tests wit hsemantic meaning checks

* migrating pricers

* feature: introducing pricing predictors (pricers)

* chore: e2e is done with new pipeline

* extra session feature extraction

* feature: experiemntal sessin pricer and metrics(vibe)

* chore: redefined and connected pricers (#29)
2025-11-29 17:50:16 +01:00

35 lines
1.0 KiB
Python
Executable File

import pandas as pd
from procesing.steps.base import BaseContextStep
class ChunkByTimeWindowStep(BaseContextStep):
"""
Chunk dataframe into time windows.
Returns list of dicts with window metadata.
"""
def transform(self, df: pd.DataFrame):
if df.empty:
return []
df = df.copy()
ts_col = self.context.config.get('ts_col', 'ts')
window_size = self.context.window_size
# ensure datetime
if not pd.api.types.is_datetime64_any_dtype(df[ts_col]):
df[ts_col] = pd.to_datetime(df[ts_col])
df = df.sort_values(ts_col)
df['_window'] = df[ts_col].dt.floor(window_size)
chunks = []
for idx, (window_start, group) in enumerate(df.groupby('_window')):
chunks.append({
'window_start': window_start,
'window_end': window_start + pd.Timedelta(window_size),
'window_idx': idx,
'data': group.drop(columns=['_window'])
})
return chunks