Add electricity price ingestion and feature pipeline.
Introduce ENTSO-E data retrieval with layered caching, robust bidding-zone and missing-data handling, and persist model-ready features with detailed architecture/developer documentation. Made-with: Cursor
This commit is contained in:
362
electricity_price_predictor/docs/developer_guide.md
Normal file
362
electricity_price_predictor/docs/developer_guide.md
Normal file
@@ -0,0 +1,362 @@
|
||||
# Developer Guide (Deep Dive)
|
||||
|
||||
This guide explains each file in the module, execution order, control flow, and data/state transitions so you can reason about behavior without reading source code.
|
||||
|
||||
## 1) Directory map and responsibilities
|
||||
|
||||
### Top-level
|
||||
|
||||
- `requirements.txt`
|
||||
- Python dependencies for ingestion and DB persistence.
|
||||
- `README.md`
|
||||
- Operator-focused setup and run commands.
|
||||
- `sql/001_electricity_price_schema.sql`
|
||||
- DDL for cache, raw observations, and feature store.
|
||||
- `scripts/init_db.py`
|
||||
- Applies the SQL schema to `quant_db`.
|
||||
- `scripts/build_feature_store.py`
|
||||
- CLI entrypoint for data fetch + feature persistence.
|
||||
- `docs/architecture.md`
|
||||
- High-level architecture summary.
|
||||
- `docs/developer_guide.md`
|
||||
- This detailed developer-facing explanation.
|
||||
|
||||
### Python package (`src/electricity_price_predictor`)
|
||||
|
||||
- `__init__.py`
|
||||
- Public package exports (`get_engine`, `EntsoeDataService`, `build_feature_frame`).
|
||||
- `db.py`
|
||||
- Builds DB URL from env vars and creates SQLAlchemy `Engine`.
|
||||
- `cache.py`
|
||||
- Implements decorator-based DB cache with deterministic keying.
|
||||
- `entsoe_api.py`
|
||||
- Wraps ENTSO-E API calls, normalizes data, and writes raw observations.
|
||||
- `features.py`
|
||||
- Pure feature engineering logic (residual load, lags, cyclical encoding).
|
||||
- `pipeline.py`
|
||||
- Orchestration layer for end-to-end fetch -> raw persist -> feature build -> feature persist.
|
||||
|
||||
## 2) Runtime execution path (step-by-step)
|
||||
|
||||
When you run:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src python3 scripts/build_feature_store.py --country-code ... --start ... --end ...
|
||||
```
|
||||
|
||||
Execution sequence:
|
||||
|
||||
1. **Argument parsing**
|
||||
- `build_feature_store.py` reads country code/time range/TTL.
|
||||
2. **Credential/connection bootstrap**
|
||||
- checks `ENTSOE_API_KEY`.
|
||||
- calls `get_engine()` from `db.py`.
|
||||
3. **Pipeline orchestration**
|
||||
- `run_feature_pipeline(...)` in `pipeline.py` starts.
|
||||
4. **API service creation**
|
||||
- initializes `EntsoePandasClient`.
|
||||
- creates `EntsoeDataService(client, engine, cache_ttl_hours)`.
|
||||
5. **Decorator wrapping**
|
||||
- in `EntsoeDataService.__post_init__`, API methods are wrapped by `cache_to_db(...)`.
|
||||
6. **Data retrieval**
|
||||
- `fetch_inputs(...)` calls:
|
||||
- `get_day_ahead_prices(...)`
|
||||
- `get_load_forecast(...)`
|
||||
- `get_wind_solar_forecast(...)`
|
||||
- country aliases are normalized to bidding zones before queries (currently `DE -> DE_LU`, `IT -> IT_NORD`).
|
||||
7. **Cache check/compute loop (per call)**
|
||||
- decorator computes hash key from function + args.
|
||||
- if non-expired row exists in `entsoe_api_cache`: returns payload.
|
||||
- else: reads `electricity_market_observations` for requested timestamps.
|
||||
- if timestamps are missing there, only missing hourly ranges are requested from ENTSO-E.
|
||||
- `NoMatchingDataError` from ENTSO-E is converted to an empty hourly frame for that endpoint/range.
|
||||
- normalized responses coalesce duplicate semantic columns (for example multiple wind/solar columns) via first non-null-per-row.
|
||||
- missing rows are upserted into `electricity_market_observations`.
|
||||
- final merged dataset is stored in `entsoe_api_cache` and returned.
|
||||
8. **Raw persistence**
|
||||
- merged inputs are upserted to `electricity_market_observations`.
|
||||
9. **Feature engineering**
|
||||
- `build_feature_frame(...)` computes:
|
||||
- `residual_load = load - wind - solar`
|
||||
- `lagged_price_1..24`
|
||||
- `lagged_residual_load_1..24`
|
||||
- `hour_of_day_sin/cos`, `weekday_sin/cos`, `month_sin/cos`
|
||||
- preserves source missingness as `NaN` (no 0.0 imputation).
|
||||
- drops rows only when `day_ahead_price` / `lagged_price_1..24` are missing (lag warmup requirement).
|
||||
10. **Feature-store persistence**
|
||||
- lags are materialized into PostgreSQL arrays (`DOUBLE PRECISION[]`, length 24).
|
||||
- rows violating NOT NULL core feature constraints are filtered out before upsert.
|
||||
- persistable rows are upserted to `electricity_price_features`.
|
||||
11. **CLI completion**
|
||||
- prints persisted row count.
|
||||
|
||||
## 3) UML diagrams
|
||||
|
||||
## 3.1 Component diagram
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
CLI[scripts/build_feature_store.py] --> PIPE[pipeline.run_feature_pipeline]
|
||||
PIPE --> DBMOD[db.get_engine]
|
||||
PIPE --> SERVICE[EntsoeDataService]
|
||||
SERVICE --> CACHEDEC[cache_to_db decorator]
|
||||
SERVICE --> ENTSOE[EntsoePandasClient]
|
||||
SERVICE --> SECONDARY[electricity_market_observations secondary cache]
|
||||
PIPE --> FEAT[features.build_feature_frame NaN-preserving]
|
||||
FEAT --> PERSIST[pipeline.persist_feature_frame null-filtered]
|
||||
CACHEDEC --> DB[(quant_db.entsoe_api_cache)]
|
||||
SECONDARY --> RAW[(quant_db.electricity_market_observations)]
|
||||
PERSIST --> STORE[(quant_db.electricity_price_features)]
|
||||
```
|
||||
|
||||
## 3.2 Class diagram (logical)
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class EntsoeDataService {
|
||||
+client: EntsoePandasClient
|
||||
+engine: Engine
|
||||
+cache_ttl_hours: Optional[int]
|
||||
+fetch_inputs(country_code, start, end) DataFrame
|
||||
+upsert_raw_data(country_code, frame) None
|
||||
-_get_day_ahead_prices_impl(country_code, start, end) Series
|
||||
-_get_load_forecast_impl(country_code, start, end) Series
|
||||
-_get_wind_solar_forecast_impl(country_code, start, end) DataFrame
|
||||
}
|
||||
|
||||
class CacheDecorator {
|
||||
+cache_to_db(engine, namespace, ttl_hours) decorator
|
||||
-_build_cache_key(function_name, args, kwargs) str
|
||||
}
|
||||
|
||||
class FeatureBuilder {
|
||||
+build_feature_frame(inputs, max_lag=24) DataFrame
|
||||
-_cyclical_encode(values, period, prefix) DataFrame
|
||||
}
|
||||
|
||||
class Pipeline {
|
||||
+run_feature_pipeline(engine, entsoe_api_key, country_code, start, end, cache_ttl_hours) DataFrame
|
||||
+persist_feature_frame(engine, country_code, feature_frame) None
|
||||
}
|
||||
|
||||
Pipeline --> EntsoeDataService : uses
|
||||
Pipeline --> FeatureBuilder : uses
|
||||
EntsoeDataService --> CacheDecorator : wraps methods
|
||||
```
|
||||
|
||||
## 3.3 Sequence diagram (single API method with cache)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Caller as fetch_inputs()
|
||||
participant Decorator as cache_to_db wrapper
|
||||
participant CacheTable as entsoe_api_cache (L1)
|
||||
participant ObsTable as electricity_market_observations (L2)
|
||||
participant API as ENTSO-E API
|
||||
|
||||
Caller->>Decorator: get_day_ahead_prices(country, start, end)
|
||||
Decorator->>CacheTable: SELECT by cache_key and expires_at
|
||||
alt L1 cache hit
|
||||
CacheTable-->>Decorator: payload
|
||||
Decorator-->>Caller: unpickled pandas object
|
||||
else L1 cache miss/expired
|
||||
Decorator->>ObsTable: SELECT existing timestamps
|
||||
alt L2 fully covers range
|
||||
ObsTable-->>Decorator: pandas-compatible rows
|
||||
else L2 has gaps
|
||||
Decorator->>API: query only missing ranges
|
||||
alt API returns data
|
||||
API-->>Decorator: missing rows
|
||||
Decorator->>Decorator: normalize columns + coalesce duplicates
|
||||
Decorator->>ObsTable: UPSERT missing rows
|
||||
else NoMatchingDataError
|
||||
Decorator->>Decorator: synthesize empty hourly frame
|
||||
end
|
||||
end
|
||||
Decorator->>CacheTable: INSERT/UPSERT merged payload
|
||||
Decorator-->>Caller: fresh result
|
||||
end
|
||||
```
|
||||
|
||||
## 3.4 State diagram (cache entry lifecycle)
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> L1Missing
|
||||
L1Missing --> L2Check: cache miss/expiry
|
||||
L2Check --> Fresh: observation table fully covers range
|
||||
L2Check --> Partial: observation table has gaps
|
||||
Partial --> Fresh: fetch missing ranges, upsert L2, upsert L1
|
||||
Fresh --> Fresh: reused before expiry
|
||||
Fresh --> Expired: TTL passes for L1 entry
|
||||
Expired --> L2Check: next call
|
||||
Fresh --> Overwritten: Same key, new payload upsert
|
||||
Overwritten --> Fresh
|
||||
```
|
||||
|
||||
## 3.5 ER diagram (database schema)
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
entsoe_api_cache {
|
||||
text cache_key PK
|
||||
text namespace
|
||||
text function_name
|
||||
jsonb args_json
|
||||
bytea payload
|
||||
timestamptz created_at
|
||||
timestamptz expires_at
|
||||
}
|
||||
|
||||
electricity_market_observations {
|
||||
text country_code PK
|
||||
timestamptz delivery_start PK
|
||||
float day_ahead_price
|
||||
float load_forecast
|
||||
float wind_forecast
|
||||
float solar_forecast
|
||||
timestamptz ingested_at
|
||||
}
|
||||
|
||||
electricity_price_features {
|
||||
text country_code PK
|
||||
timestamptz delivery_start PK
|
||||
text feature_version PK
|
||||
float day_ahead_price
|
||||
float load_forecast
|
||||
float wind_forecast
|
||||
float solar_forecast
|
||||
float residual_load
|
||||
float[] lagged_price
|
||||
float[] lagged_residual_load
|
||||
float hour_of_day_sin
|
||||
float hour_of_day_cos
|
||||
float weekday_sin
|
||||
float weekday_cos
|
||||
float month_sin
|
||||
float month_cos
|
||||
timestamptz created_at
|
||||
}
|
||||
```
|
||||
|
||||
## 4) How files collaborate
|
||||
|
||||
## 4.1 `db.py` + scripts
|
||||
|
||||
- Scripts never hardcode DB URI; they call `get_engine()`.
|
||||
- `get_engine()` centralizes environment-driven connectivity.
|
||||
|
||||
## 4.2 `cache.py` + `entsoe_api.py`
|
||||
|
||||
- `cache_to_db()` is generic and independent of ENTSO-E specifics.
|
||||
- `EntsoeDataService.__post_init__` binds that generic decorator to each API-fetch method.
|
||||
- Result: all expensive API calls automatically become cache-aware without changing call sites.
|
||||
|
||||
## 4.3 `entsoe_api.py` + `features.py`
|
||||
|
||||
- `entsoe_api.py` guarantees normalized timestamp index and expected source columns.
|
||||
- `features.py` assumes these columns and transforms them to model features only (no DB side effects).
|
||||
|
||||
## 4.4 `features.py` + `pipeline.py`
|
||||
|
||||
- `build_feature_frame()` returns wide DataFrame with `lagged_*_1..24`.
|
||||
- `persist_feature_frame()` converts those to PostgreSQL arrays so table rows stay compact and versioned.
|
||||
|
||||
## 5) Important implementation details
|
||||
|
||||
- **Cache keys are deterministic**
|
||||
- Built from JSON of function name + args + kwargs with stable sorting.
|
||||
- **Cache payload type**
|
||||
- `pickle` stored in `BYTEA` to preserve pandas objects.
|
||||
- **TTL logic**
|
||||
- `expires_at IS NULL` means never expires.
|
||||
- Otherwise must be greater than current UTC time to be considered valid.
|
||||
- **Two-layer cache order**
|
||||
- Layer 1: `entsoe_api_cache` (function-result cache).
|
||||
- Layer 2: `electricity_market_observations` (timestamp-level raw cache).
|
||||
- API calls happen only for Layer-2 gaps.
|
||||
- **Upsert strategy**
|
||||
- Raw and feature tables use `ON CONFLICT ... DO UPDATE` for idempotent reruns.
|
||||
- Raw upsert uses `COALESCE(EXCLUDED.col, existing.col)` to avoid null-overwriting previously stored values during partial refreshes.
|
||||
- Feature upsert operates on a filtered persistable subset where core NOT NULL columns are present.
|
||||
- **Missingness semantics**
|
||||
- Forecast and derived residual columns preserve `NaN` in memory.
|
||||
- No zero-imputation is performed for missing forecast values.
|
||||
- **Bidding-zone normalization**
|
||||
- `resolve_bidding_zone_code(...)` maps common country aliases to ENTSO-E zone codes.
|
||||
- Pipeline persistence uses the resolved code, ensuring DB keys match actual queried zones.
|
||||
- **Timezone handling**
|
||||
- API index is normalized to UTC to avoid DST ambiguity in lag features.
|
||||
- **Feature warmup**
|
||||
- Rows missing `day_ahead_price` or any `lagged_price_1..24` are dropped because lag history is incomplete.
|
||||
|
||||
## 6) Failure modes and expected behavior
|
||||
|
||||
- Missing `ENTSOE_API_KEY` -> CLI raises early runtime error.
|
||||
- Missing required input columns -> feature builder raises `ValueError`.
|
||||
- Duplicate normalized columns from ENTSO-E payloads -> coalesced before reindexing to avoid pandas duplicate-label reindex errors.
|
||||
- ENTSO-E no-data responses for an endpoint/range -> transformed to empty hourly frames and merged safely.
|
||||
- Empty data frame -> raw/feature persistence functions no-op safely.
|
||||
- Repeated identical request -> cache hit (no API roundtrip).
|
||||
- Expired L1 cache row + full L2 coverage -> no API call required.
|
||||
- Expired L1 cache row + partial L2 coverage -> API called only for missing ranges.
|
||||
|
||||
## 7) Data contracts
|
||||
|
||||
### 7.1 In-memory features contract
|
||||
|
||||
Producer: `run_feature_pipeline(...)` return value (`pd.DataFrame`).
|
||||
|
||||
- **Index contract**
|
||||
- hourly UTC `DatetimeIndex`, sorted ascending.
|
||||
- unique timestamps expected after deduplication.
|
||||
- **Column contract**
|
||||
- base: `day_ahead_price`, `load_forecast`, `wind_forecast`, `solar_forecast`
|
||||
- derived: `residual_load`
|
||||
- lag columns: `lagged_price_1..24`, `lagged_residual_load_1..24`
|
||||
- cyclical: `hour_of_day_sin/cos`, `weekday_sin/cos`, `month_sin/cos`
|
||||
- **Nullability contract**
|
||||
- required non-null in returned rows: `day_ahead_price`, `lagged_price_1..24`
|
||||
- nullable: `load_forecast`, `wind_forecast`, `solar_forecast`, `residual_load`, and `lagged_residual_load_*`
|
||||
- rationale: preserve upstream missingness semantics for analysis and QC.
|
||||
|
||||
### 7.2 Feature-store persistence contract
|
||||
|
||||
Consumer: `electricity_price_features` table.
|
||||
|
||||
- **Primary key contract**
|
||||
- (`country_code`, `delivery_start`, `feature_version`)
|
||||
- **Schema constraint contract**
|
||||
- core numeric columns are `NOT NULL`.
|
||||
- lag arrays are `DOUBLE PRECISION[]` and expected length 24.
|
||||
- **Write-time contract**
|
||||
- `persist_feature_frame(...)` filters rows that violate NOT NULL core columns before UPSERT.
|
||||
- retained rows are idempotently upserted via `ON CONFLICT ... DO UPDATE`.
|
||||
|
||||
### 7.3 Raw-observation contract
|
||||
|
||||
Consumer: `electricity_market_observations` table.
|
||||
|
||||
- **Primary key contract**
|
||||
- (`country_code`, `delivery_start`)
|
||||
- **Merge contract**
|
||||
- upsert uses `COALESCE(EXCLUDED.col, existing.col)` to avoid null-overwriting prior known values.
|
||||
- **Coverage contract**
|
||||
- secondary cache guarantees fetched payloads are aligned to expected hourly index for the requested `[start, end)` range.
|
||||
|
||||
## 8) Practical debugging checklist
|
||||
|
||||
1. Run `scripts/init_db.py` and ensure tables exist.
|
||||
2. Run one short-range fetch window (1-2 days) first.
|
||||
3. Verify cache growth:
|
||||
- `SELECT namespace, function_name, COUNT(*) FROM entsoe_api_cache GROUP BY 1,2;`
|
||||
4. Verify raw persistence:
|
||||
- `SELECT COUNT(*) FROM electricity_market_observations WHERE country_code = '...';`
|
||||
5. Verify feature persistence:
|
||||
- check lag array sizes are 24 and row count is lower than raw by about 24.
|
||||
|
||||
## 9) Suggested next developer docs to add
|
||||
|
||||
- Data quality rules (acceptable missingness, clipping policy, anomaly handling).
|
||||
- Training-set contract (target definition, split strategy, leakage constraints).
|
||||
- Backfill/replay policy for reprocessing historical periods.
|
||||
Reference in New Issue
Block a user