Airflow addition (#28)

* introducing airflow to run pipeline * chore: updating dag with upload to registry * introducing complete provider (non refactored and noisy) * chore: removing old shit * generic pricing baselines * feature: super simple model registry (to be updated maybe third party OS software) * chore: refactoring the providers docker config and requirements * chore: refactored and broke down components (braking * exporting all * local pipeline excution working * fix: fixing import structures from nonrelativistic * chore: enables cross comm pickling with fully e2e pipeline compilation * docs: what the pipeline is like now * pipelines local running and pipeline high level definition * cleaning old pipeline and vectorization * leaked but fixing, not so important * test: started with pipeline step testing * chore: cleaning up provider of prices * test: extra tests wit hsemantic meaning checks * migrating pricers * feature: introducing pricing predictors (pricers) * chore: e2e is done with new pipeline * extra session feature extraction * feature: experiemntal sessin pricer and metrics(vibe) * chore: redefined and connected pricers (#29)
2026-07-16 01:53:37 +00:00 · 2025-11-29 17:50:16 +01:00
parent 2a0e44ab24
commit ad9423bf59
49 changed files with 3642 additions and 619 deletions
--- a/experiments/procesing/steps/chunk.py
+++ b/experiments/procesing/steps/chunk.py
@@ -0,0 +1,34 @@
+import pandas as pd
+from procesing.steps.base import BaseContextStep
+
+class ChunkByTimeWindowStep(BaseContextStep):
+    """
+    Chunk dataframe into time windows.
+    Returns list of dicts with window metadata.
+    """
+
+    def transform(self, df: pd.DataFrame):
+        if df.empty:
+            return []
+
+        df = df.copy()
+        ts_col = self.context.config.get('ts_col', 'ts')
+        window_size = self.context.window_size
+
+        # ensure datetime
+        if not pd.api.types.is_datetime64_any_dtype(df[ts_col]):
+            df[ts_col] = pd.to_datetime(df[ts_col])
+
+        df = df.sort_values(ts_col)
+        df['_window'] = df[ts_col].dt.floor(window_size)
+
+        chunks = []
+        for idx, (window_start, group) in enumerate(df.groupby('_window')):
+            chunks.append({
+                'window_start': window_start,
+                'window_end': window_start + pd.Timedelta(window_size),
+                'window_idx': idx,
+                'data': group.drop(columns=['_window'])
+            })
+
+        return chunks