Add electricity price ingestion and feature pipeline.
Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation. Made-with: Cursor
This commit is contained in:
73
electricity_price_predictor/docs/architecture.md
Normal file
73
electricity_price_predictor/docs/architecture.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Architecture Notes
|
||||
|
||||
This document is the quick architecture map. For full file-by-file implementation details, see `docs/developer_guide.md`.
|
||||
|
||||
## End-to-end data flow
|
||||
|
||||
1. `scripts/build_feature_store.py` parses CLI arguments and validates env vars.
|
||||
2. It calls `pipeline.run_feature_pipeline(...)`.
|
||||
3. `EntsoeDataService.fetch_inputs(...)` loads:
|
||||
- `day_ahead_price`
|
||||
- `load_forecast`
|
||||
- `wind_forecast`
|
||||
- `solar_forecast`
|
||||
4. Each ENTSO-E call is wrapped by `cache_to_db(...)` and either:
|
||||
- serves a hit from `entsoe_api_cache`, or
|
||||
- falls back to `electricity_market_observations` for already-known timestamps,
|
||||
- and performs API calls only for missing hourly intervals.
|
||||
5. Missing intervals returned from API are upserted into `electricity_market_observations`.
|
||||
6. The final merged result is cached in `entsoe_api_cache`.
|
||||
7. Raw merged series are upserted to `electricity_market_observations`.
|
||||
8. `features.build_feature_frame(...)` computes:
|
||||
- `residual_load`
|
||||
- lagged arrays (24 values each)
|
||||
- cyclical encodings for hour/weekday/month
|
||||
- preserves `NaN` for missing forecast-derived values.
|
||||
9. `pipeline.persist_feature_frame(...)` upserts model-ready rows to `electricity_price_features`.
|
||||
- filters out rows that violate feature-table NOT NULL constraints.
|
||||
|
||||
## Process diagram
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[build_feature_store.py CLI] --> B[run_feature_pipeline]
|
||||
B --> C[EntsoeDataService.fetch_inputs]
|
||||
C --> D{Hit in entsoe_api_cache?}
|
||||
D -->|Yes| E[Load payload from entsoe_api_cache]
|
||||
D -->|No| F[Read electricity_market_observations]
|
||||
F --> G{Missing hourly timestamps?}
|
||||
G -->|No| H[Reuse DB observation rows]
|
||||
G -->|Yes| I[Call ENTSO-E only for missing ranges]
|
||||
I --> I2{NoMatchingDataError?}
|
||||
I2 -->|Yes| I3[Use empty hourly frame for endpoint]
|
||||
I2 -->|No| I4[Normalize payload]
|
||||
I4 --> I5[Coalesce duplicate columns by first non-null]
|
||||
I3 --> J[Upsert missing rows to electricity_market_observations]
|
||||
I5 --> J
|
||||
H --> K[Build merged input DataFrame]
|
||||
J --> K
|
||||
K --> L[Store payload in entsoe_api_cache]
|
||||
E --> M[Use cached input DataFrame]
|
||||
L --> N[Upsert electricity_market_observations]
|
||||
M --> N
|
||||
N --> O[build_feature_frame]
|
||||
O --> P[Create lags + cyclical features]
|
||||
P --> P2[Preserve NaN in forecast-derived columns]
|
||||
P2 --> P3[Drop rows missing day_ahead_price or lagged_price_1..24]
|
||||
P3 --> Q[Upsert persistable subset into electricity_price_features]
|
||||
```
|
||||
|
||||
## Key design reasons
|
||||
|
||||
- DB cache avoids repeated ENTSO-E calls during iterative model work.
|
||||
- Observation-table fallback avoids re-fetching timestamps already persisted once.
|
||||
- Pickled payloads preserve exact pandas object shape and index information.
|
||||
- Feature table stores fixed-size lag arrays so one row corresponds to one prediction timestamp.
|
||||
- Missing forecasts are kept as `NaN` in analysis outputs, avoiding misleading zero-imputation.
|
||||
- Persistence layer enforces schema compatibility by skipping rows with nulls in NOT NULL feature columns.
|
||||
|
||||
## Extension points
|
||||
|
||||
- Add label/target tables (`t+1`, `t+24`, etc.).
|
||||
- Add training metadata + model registry tables.
|
||||
- Add partitioning strategy for multi-year production-scale data.
|
||||
Reference in New Issue
Block a user