# ChiNext 50 Regime Project Starter

这是一个**创业板50专用**的日频 regime-aware exposure control 项目骨架。

它的目标不是预测每天涨跌，而是尽量做到：
- 大跌/拥挤期少亏
- 真修复阶段逐步回补
- 主升段保留大部分参与率

## 当前已经搭好的内容

- `data/`：CSV/parquet 读取器 + synthetic demo 数据生成器
- `features/`：价格、广度、相对强弱三层特征
- `model/`：连续分数、5 态状态机、仓位映射和硬 veto
- `backtest/`：next-open 近似执行回测、utility、事件切片
- `pipelines/`：demo 管线 + frozen-hypothesis validation
- `tests/`：最小端到端测试

## 核心状态

- `risk_off`
- `repair`
- `trend`
- `chop`
- `euphoric_late`

## 核心分数

- `trend_score`
- `breadth_score`
- `stress_score`
- `crowding_score`
- `repair_score`

以及三个路径型 hazard：
- `down_hazard`
- `repair_hazard`
- `rebound_hazard`

## 运行 demo

在项目根目录执行：

```bash
python pipelines/run_demo.py \
  --pit-csv path/to/chinext50_pit.csv \
  --output-dir outputs/demo
```

这会使用 synthetic 数据生成：
- `outputs/demo/daily_ledger.csv`
- `outputs/demo/event_summary.csv`
- `outputs/demo/metrics_summary.json`

## 运行 frozen-hypothesis 验证

```bash
python pipelines/frozen_hypothesis_validation.py \
  --pit-csv path/to/chinext50_pit.csv \
  --output-dir outputs/frozen_validation
```

## 换成真实数据

你的 CSV/parquet 至少需要这些列：

- `date`
- `open`
- `high`
- `low`
- `close`
- `volume`

建议同时提供：
- `hs300_close`
- `star50_close`
- `csi1000_close`
- `pct_constituents_above_20dma`
- `pct_constituents_above_60dma`
- `pct_new_high_20`
- `pct_new_low_20`
- `eq_weight_ret_5`
- `weighted_ret_5`
- `top3_contribution_5`
- `corr_spike_20`
- `dispersion_20`

运行方式：

```bash
python pipelines/run_demo.py \
  --pit-csv path/to/chinext50_pit.csv \
  --output-dir outputs/real_data_demo
```

## 重要说明

- 当前 scaffold **不是**业绩证明，只是把“特征 -> 分数 -> 状态 -> 仓位 -> 回测 -> 事件诊断”这条闭环先搭通。
- economic effect 需要你接入**真实的创业板50指数/ETF历史 + 历史成分股宽度数据**后再做严格 walk-forward 验证。
- 第一阶段不要同时扩到多市场或复杂 readiness/portability 系统。

## Real Data Input Contract and Quality Gate

The runtime pipelines now require a full point-in-time dataset and can optionally block low-quality data before feature construction.

### Required PIT columns

`date`, `open`, `high`, `low`, `close`, `volume`, `hs300_close`, `star50_close`, `csi1000_close`, `pct_constituents_above_20dma`, `pct_constituents_above_60dma`, `pct_new_high_20`, `pct_new_low_20`, `eq_weight_ret_5`, `weighted_ret_5`, `top3_contribution_5`, `top1_contribution_5`, `top10_contribution_5`, `sector_concentration_20`, `corr_spike_20`, `dispersion_20`

- Column names are normalized to lowercase with surrounding whitespace removed.
- Duplicate trading dates are rejected.
- Rows are sorted by trading date before downstream processing.

- Runtime entrypoints no longer merge sidecars on the fly.
- If required PIT columns are missing, the pipeline fails before quality gate and feature construction.

### Data quality gate modes

- Non-strict (default): pipeline continues and records warnings when critical-column coverage is below threshold.
- Strict (`--strict-data`): pipeline stops only when configured `blocking_columns` are breached; non-blocking breaches remain warnings.

Coverage threshold configuration:

- Config defaults: `config/regime.yaml` -> `data_quality.default_min_coverage` and `data_quality.column_min_coverage`
- CLI override: `--min-coverage`

### Output artifact

Each run writes `data_quality_summary.json` into the output directory.
This artifact includes gate mode, pass/fail status, breach severities (`error`/`warning`), and field-level coverage metrics.

### Example commands

```bash
python pipelines/run_demo.py \
  --pit-csv path/to/chinext50_pit.csv \
  --strict-data \
  --min-coverage 0.98 \
  --output-dir outputs/real_data_demo
```

```bash
python pipelines/frozen_hypothesis_validation.py \
  --pit-csv path/to/chinext50_pit.csv \
  --strict-data \
  --min-coverage 0.98 \
  --output-dir outputs/frozen_validation_real
```

## Build Point-In-Time (PIT) Dataset

Use `pipelines/build_pit_dataset.py` to create a reusable point-in-time table before running strategy pipelines.

### Command

```bash
python pipelines/build_pit_dataset.py \
  --market-csv path/to/chinext50_market.csv \
  --sidecar-csv path/to/chinext50_benchmark_sidecar.csv \
  --sidecar-csv path/to/chinext50_breadth_sidecar.csv \
  --output-path outputs/pit/chinext50_pit.csv
```

Optional quality controls:

- `--strict-data`: block PIT output when quality breaches occur
- `--min-coverage 0.98`: override minimum non-null coverage threshold
- `--config path/to/regime.yaml`: load custom quality defaults

### Output semantics

- Always writes `pit_quality_summary.json` in the same output directory.
- On success, writes PIT data to `--output-path` (`.csv` or `.parquet`).
- In strict failure mode, PIT file is not written, but `pit_quality_summary.json` is still written for diagnostics.
- Quality summary includes source metadata:
  - `sources.market_path`
  - `sources.sidecar_paths`
  - `sources.sidecar_count`
- `sources.merged_row_count`
  - `pit_columns`

## Real Data Ingestion

Use `pipelines/ingest_real_data.py` to fetch/load source data, publish `raw` + `staging` layers, and output final PIT in one run.

### CSV provider (local source files)

```bash
python pipelines/ingest_real_data.py \
  --provider csv \
  --market-csv path/to/chinext50_market.csv \
  --hs300-csv path/to/hs300.csv \
  --star50-csv path/to/star50.csv \
  --csi1000-csv path/to/csi1000.csv \
  --breadth-csv path/to/chinext50_breadth.csv \
  --output-dir outputs/ingestion
```

### Akshare provider (online fetch + local breadth)

```bash
python pipelines/ingest_real_data.py \
  --provider akshare \
  --market-symbol 159915 \
  --market-symbol-type etf \
  --hs300-symbol 000300 \
  --star50-symbol 000688 \
  --csi1000-symbol 000852 \
  --start-date 2018-01-01 \
  --end-date 2026-04-09 \
  --breadth-csv path/to/chinext50_breadth.csv \
  --output-dir outputs/ingestion
```

### Akshare + Mairui fallback (recommended when Akshare缺字段或不可用)

```bash
python pipelines/ingest_real_data.py \
  --provider akshare \
  --market-symbol 159915 \
  --market-symbol-type etf \
  --breadth-csv path/to/chinext50_breadth.csv \
  --mairui-licence YOUR_MAIRUI_LICENCE \
  --mairui-market-code 399673.SZ \
  --mairui-hs300-code 000300.SH \
  --mairui-star50-code 000688.SH \
  --mairui-csi1000-code 000852.SH \
  --start-date 2018-01-01 \
  --end-date 2026-04-09 \
  --output-dir outputs/ingestion
```

### Mairui provider (online fetch as primary)

```bash
python pipelines/ingest_real_data.py \
  --provider mairui \
  --mairui-licence YOUR_MAIRUI_LICENCE \
  --mairui-market-code 399673.SZ \
  --mairui-market-kind index \
  --mairui-hs300-code 000300.SH \
  --mairui-star50-code 000688.SH \
  --mairui-csi1000-code 000852.SH \
  --breadth-csv path/to/chinext50_breadth.csv \
  --start-date 2018-01-01 \
  --end-date 2026-04-09 \
  --output-dir outputs/ingestion
```

If breadth fields are also served by a Mairui endpoint, you can replace `--breadth-csv` with:

- `--mairui-breadth-url https://api.mairuiapi.com/xxx/{licence}`
- optional `--mairui-breadth-map-json path/to/rename_map.json`

If you do not trust an external breadth panel (or do not have one), you can derive breadth from constituent histories:

```bash
python pipelines/ingest_real_data.py \
  --provider mairui \
  --mairui-licence YOUR_MAIRUI_LICENCE \
  --mairui-market-code 399673.SZ \
  --mairui-market-kind index \
  --mairui-hs300-code 000300.SH \
  --mairui-star50-code 000688.SH \
  --mairui-csi1000-code 000852.SH \
  --derive-breadth \
  --breadth-index-symbol 399673 \
  --breadth-min-active-constituents 20 \
  --breadth-max-constituents 50 \
  --breadth-cache-dir outputs/ingestion/raw/constituent_history \
  --output-dir outputs/ingestion
```

Strict mode now includes a breadth-source integrity gate. Placeholder-like breadth inputs (for example, constant `weighted_ret_5 - eq_weight_ret_5`) are blocked before PIT publish.

Output structure includes:

- `outputs/ingestion/raw/*.csv`
- `outputs/ingestion/raw/breadth_integrity_summary.json`
- `outputs/ingestion/raw/breadth_derivation_summary.json` (when `--derive-breadth` is used)
- `outputs/ingestion/staging/*.csv`
- `outputs/ingestion/pit/chinext50_pit.csv`
- `outputs/ingestion/pit/pit_quality_summary.json`
- `outputs/ingestion/ingestion_manifest.json`

## Frozen Walk-Forward (Train-Select / Test-Freeze)

`pipelines/frozen_hypothesis_validation.py` now runs a strict frozen-hypothesis process:

1. Evaluate predefined candidates only on each training window.
2. Select one winner by training utility (deterministic tie-break by candidate order).
3. Freeze that winner and evaluate the paired test window without re-selection.

### Candidate configuration

Candidates can come from:

- `config/regime.yaml` -> `frozen_validation.candidates`
- optional CLI override file: `--candidates-json path/to/candidates.json`

Window row requirements:

- `frozen_validation.min_train_rows` (or `--min-train-rows`)
- `frozen_validation.min_test_rows` (or `--min-test-rows`)

If a window is too short, it is marked as skipped with an explicit status.

### Audit outputs

`frozen_validation_board.csv` now includes:

- window ranges (`train_*`, `test_*`)
- `status`
- `selected_candidate_id`
- `selected_candidate_overrides` (serialized JSON)
- prefixed train/test metrics such as `train_utility_total_score` and `test_utility_total_score`

`frozen_validation_summary.json` now includes:

- processed/skipped window counts
- positive test-utility ratio
- selected candidate distribution
- status distribution

### Example

```bash
python pipelines/frozen_hypothesis_validation.py \
  --pit-csv path/to/chinext50_pit.csv \
  --candidates-json path/to/frozen_candidates.json \
  --min-train-rows 180 \
  --min-test-rows 60 \
  --output-dir outputs/frozen_validation_real
```

## Real Walk-Forward Report

Use `pipelines/real_walkforward_report.py` to generate a review-ready bundle from full PIT input:

- `data_quality_summary.json`
- `frozen_validation_board.csv`
- `real_walkforward_summary.json`
- `real_walkforward_report.md`

```bash
python pipelines/real_walkforward_report.py \
  --pit-csv path/to/chinext50_pit.csv \
  --strict-data \
  --output-dir outputs/real_walkforward_report
```

## Event-Anchored Diagnostics

`run_demo` now outputs transition-anchor diagnostics with explicit event taxonomy:

- `crash_onset`
- `false_rebound`
- `true_repair`
- `crowded_unwind`
- `state_transition` (fallback class for other transitions)

### Event artifacts

- `event_log.csv`: per-transition anchor details (`event_date`, `from_state`, `to_state`, `event_type`, forward returns, exposure context)
- `event_summary.csv`: event-type grouped averages and counts

Classification logic is rule-based on state transitions plus forward-window confirmation signals for rebound quality.

## Execution Layer Constraints and Tracking Diagnostics

Backtest execution now includes configurable constraints for better ETF-style realism:

- `trading.extreme_day_move_threshold`: absolute executed return threshold that triggers cost amplification
- `trading.extreme_day_cost_multiplier`: multiplier applied to base trading cost on extreme days
- `trading.gap_slippage_factor`: additive gap shock cost factor using `abs(gap_open) * turnover`

New ledger diagnostics:

- `tracking_difference`: `strategy_return_net - strategy_return_gross`
- `tracking_error_20`: 20-day rolling std of `tracking_difference`

New summary metrics:

- `tracking_diff_mean`
- `tracking_diff_abs_mean`
- `tracking_error_20_p95`

### Execution Constraint Calibration

Use `pipelines/calibrate_execution_constraints.py` to sweep execution parameters and output a recommendation:

- `execution_calibration_grid.csv`
- `execution_calibration_recommendation.json`

```bash
python pipelines/calibrate_execution_constraints.py \
  --pit-csv path/to/chinext50_pit.csv \
  --cost-multipliers 1.0,1.25,1.5,1.75 \
  --gap-slippage-factors 0.0,0.01,0.02,0.03 \
  --output-dir outputs/execution_calibration
```

### Additional Optional Concentration Inputs

To improve crowding diagnostics, you can optionally provide:

- `top1_contribution_5`
- `top10_contribution_5`
- `sector_concentration_20`

## Regime Lite (Small-Team Runtime)

Use `pipelines/regime_lite_run.py` for a minimal operational workflow:

- 3 states only: `risk_off`, `chop`, `trend`
- fixed base exposures: `0.0`, `0.35`, `0.80`
- daily exposure step cap: `0.20`
- explicit execution profiles:
  - `baseline`: `lag1` timing, no overlay
  - `promoted_fast_entry_hold3`: prior promoted fixed-hold reference, based on `combo_fast_hold3`
  - `promoted_fast_entry_adaptive_extend`: current preferred profile after adaptive keep-vs-replace closure, based on `combo_fast_adaptive_extend`

```bash
python pipelines/regime_lite_run.py \
  --pit-csv path/to/chinext50_pit.csv \
  --profile promoted_fast_entry_adaptive_extend \
  --output-dir outputs/regime_lite
```

Current preferred lite runtime profile:

- `promoted_fast_entry_adaptive_extend`
- promotion decision artifact: `outputs/regime_lite_promotion_20260424/regime_lite_promotion_decision.json`
- rationale: the bounded adaptive closure concluded `adaptive-replace-candidate`, selecting `combo_fast_adaptive_extend` to replace the prior fixed-hold reference while keeping `baseline` as rollback-safe reference
- rollback/reference profile: `baseline`
- inspect `promotion_decision.active_adaptive_mode` plus `regime_lite_summary.json -> execution_profile.adaptive_hold_mode` / `adaptive_hold_context` to understand the active bounded hold semantics before operating it

Converged lite operational flow:

1. Run the preferred profile with `pipelines/regime_lite_run.py --profile promoted_fast_entry_adaptive_extend`.
2. Inspect `regime_lite_runtime_health.json` for bounded status `healthy` / `review` / `hold` / `rollback_recommended`.
3. Inspect `regime_lite_post_promotion_review.json` for bounded decision `keep_promoted` / `hold_and_review` / `recommend_rollback`.
4. In post-promotion review, treat `recent_window_evidence` as the primary decision basis; `full_history_reference` is reference context only, and `segmented_diagnostics` is for bounded diagnosis rather than override.
5. If health stays `healthy` and review stays `keep_promoted`, continue normal lite operation.
6. If health moves to `review` or review moves to `hold_and_review`, pause new tuning and inspect the bounded reasons before any change.
7. If health reaches `rollback_recommended` or review reaches `recommend_rollback`, switch back to `baseline` as the rollback-safe profile and keep the lite path scoped to that runtime handoff.

Artifacts:

- `regime_lite_daily_ledger.csv`
- `regime_lite_summary.json`
- `regime_lite_report.md`
- `regime_lite_runtime_health.json`
- `regime_lite_post_promotion_review.json`

### Execution Timing + Entry-Exit Experiments

Run controlled A/B experiments for:

- execution timing: `lag1` vs `fast_entry`
- entry-specific exit overlay: short trend-entry hold floor with stop guard

```bash
python pipelines/regime_lite_experiments.py \
  --pit-csv path/to/chinext50_pit.csv \
  --output-dir outputs/regime_lite_experiments
```

Artifacts:

- `regime_lite_experiment_results.csv`
- `regime_lite_experiment_summary.json`
- `regime_lite_experiment_report.html`
- `regime_lite_experiment_baseline_ledger.csv`
- `regime_lite_experiment_best_ledger.csv`
- `regime_lite_promotion_decision.json`

The experiment board now separates:

- recommendation candidate: best discovery-sample variant
- promotion status: final `promote` / `hold` / `reject` decision from deterministic holdout validation
- governance handoff target: the promoted runtime profile that must flow into bounded lite runtime health and post-promotion review

## Verification

Project health check:

```bash
py -m pytest -q
```

The repository now pins pytest collection to the main `tests/` directory, so historical deliverable bundles and backups do not pollute the default test run.