# ChiNext 50 Regime Project Starter 这是一个**创业板50专用**的日频 regime-aware exposure control 项目骨架。 它的目标不是预测每天涨跌,而是尽量做到: - 大跌/拥挤期少亏 - 真修复阶段逐步回补 - 主升段保留大部分参与率 ## 当前已经搭好的内容 - `data/`:CSV/parquet 读取器 + synthetic demo 数据生成器 - `features/`:价格、广度、相对强弱三层特征 - `model/`:连续分数、5 态状态机、仓位映射和硬 veto - `backtest/`:next-open 近似执行回测、utility、事件切片 - `pipelines/`:demo 管线 + frozen-hypothesis validation - `tests/`:最小端到端测试 ## 核心状态 - `risk_off` - `repair` - `trend` - `chop` - `euphoric_late` ## 核心分数 - `trend_score` - `breadth_score` - `stress_score` - `crowding_score` - `repair_score` 以及三个路径型 hazard: - `down_hazard` - `repair_hazard` - `rebound_hazard` ## 运行 demo 在项目根目录执行: ```bash python pipelines/run_demo.py \ --pit-csv path/to/chinext50_pit.csv \ --output-dir outputs/demo ``` 这会使用 synthetic 数据生成: - `outputs/demo/daily_ledger.csv` - `outputs/demo/event_summary.csv` - `outputs/demo/metrics_summary.json` ## 运行 frozen-hypothesis 验证 ```bash python pipelines/frozen_hypothesis_validation.py \ --pit-csv path/to/chinext50_pit.csv \ --output-dir outputs/frozen_validation ``` ## 换成真实数据 你的 CSV/parquet 至少需要这些列: - `date` - `open` - `high` - `low` - `close` - `volume` 建议同时提供: - `hs300_close` - `star50_close` - `csi1000_close` - `pct_constituents_above_20dma` - `pct_constituents_above_60dma` - `pct_new_high_20` - `pct_new_low_20` - `eq_weight_ret_5` - `weighted_ret_5` - `top3_contribution_5` - `corr_spike_20` - `dispersion_20` 运行方式: ```bash python pipelines/run_demo.py \ --pit-csv path/to/chinext50_pit.csv \ --output-dir outputs/real_data_demo ``` ## 重要说明 - 当前 scaffold **不是**业绩证明,只是把“特征 -> 分数 -> 状态 -> 仓位 -> 回测 -> 事件诊断”这条闭环先搭通。 - economic effect 需要你接入**真实的创业板50指数/ETF历史 + 历史成分股宽度数据**后再做严格 walk-forward 验证。 - 第一阶段不要同时扩到多市场或复杂 readiness/portability 系统。 ## Real Data Input Contract and Quality Gate The runtime pipelines now require a full point-in-time dataset and can optionally block low-quality data before feature construction. ### Required PIT columns `date`, `open`, `high`, `low`, `close`, `volume`, `hs300_close`, `star50_close`, `csi1000_close`, `pct_constituents_above_20dma`, `pct_constituents_above_60dma`, `pct_new_high_20`, `pct_new_low_20`, `eq_weight_ret_5`, `weighted_ret_5`, `top3_contribution_5`, `top1_contribution_5`, `top10_contribution_5`, `sector_concentration_20`, `corr_spike_20`, `dispersion_20` - Column names are normalized to lowercase with surrounding whitespace removed. - Duplicate trading dates are rejected. - Rows are sorted by trading date before downstream processing. - Runtime entrypoints no longer merge sidecars on the fly. - If required PIT columns are missing, the pipeline fails before quality gate and feature construction. ### Data quality gate modes - Non-strict (default): pipeline continues and records warnings when critical-column coverage is below threshold. - Strict (`--strict-data`): pipeline stops only when configured `blocking_columns` are breached; non-blocking breaches remain warnings. Coverage threshold configuration: - Config defaults: `config/regime.yaml` -> `data_quality.default_min_coverage` and `data_quality.column_min_coverage` - CLI override: `--min-coverage` ### Output artifact Each run writes `data_quality_summary.json` into the output directory. This artifact includes gate mode, pass/fail status, breach severities (`error`/`warning`), and field-level coverage metrics. ### Example commands ```bash python pipelines/run_demo.py \ --pit-csv path/to/chinext50_pit.csv \ --strict-data \ --min-coverage 0.98 \ --output-dir outputs/real_data_demo ``` ```bash python pipelines/frozen_hypothesis_validation.py \ --pit-csv path/to/chinext50_pit.csv \ --strict-data \ --min-coverage 0.98 \ --output-dir outputs/frozen_validation_real ``` ## Build Point-In-Time (PIT) Dataset Use `pipelines/build_pit_dataset.py` to create a reusable point-in-time table before running strategy pipelines. ### Command ```bash python pipelines/build_pit_dataset.py \ --market-csv path/to/chinext50_market.csv \ --sidecar-csv path/to/chinext50_benchmark_sidecar.csv \ --sidecar-csv path/to/chinext50_breadth_sidecar.csv \ --output-path outputs/pit/chinext50_pit.csv ``` Optional quality controls: - `--strict-data`: block PIT output when quality breaches occur - `--min-coverage 0.98`: override minimum non-null coverage threshold - `--config path/to/regime.yaml`: load custom quality defaults ### Output semantics - Always writes `pit_quality_summary.json` in the same output directory. - On success, writes PIT data to `--output-path` (`.csv` or `.parquet`). - In strict failure mode, PIT file is not written, but `pit_quality_summary.json` is still written for diagnostics. - Quality summary includes source metadata: - `sources.market_path` - `sources.sidecar_paths` - `sources.sidecar_count` - `sources.merged_row_count` - `pit_columns` ## Real Data Ingestion Use `pipelines/ingest_real_data.py` to fetch/load source data, publish `raw` + `staging` layers, and output final PIT in one run. ### CSV provider (local source files) ```bash python pipelines/ingest_real_data.py \ --provider csv \ --market-csv path/to/chinext50_market.csv \ --hs300-csv path/to/hs300.csv \ --star50-csv path/to/star50.csv \ --csi1000-csv path/to/csi1000.csv \ --breadth-csv path/to/chinext50_breadth.csv \ --output-dir outputs/ingestion ``` ### Akshare provider (online fetch + local breadth) ```bash python pipelines/ingest_real_data.py \ --provider akshare \ --market-symbol 159915 \ --market-symbol-type etf \ --hs300-symbol 000300 \ --star50-symbol 000688 \ --csi1000-symbol 000852 \ --start-date 2018-01-01 \ --end-date 2026-04-09 \ --breadth-csv path/to/chinext50_breadth.csv \ --output-dir outputs/ingestion ``` ### Akshare + Mairui fallback (recommended when Akshare缺字段或不可用) ```bash python pipelines/ingest_real_data.py \ --provider akshare \ --market-symbol 159915 \ --market-symbol-type etf \ --breadth-csv path/to/chinext50_breadth.csv \ --mairui-licence YOUR_MAIRUI_LICENCE \ --mairui-market-code 399673.SZ \ --mairui-hs300-code 000300.SH \ --mairui-star50-code 000688.SH \ --mairui-csi1000-code 000852.SH \ --start-date 2018-01-01 \ --end-date 2026-04-09 \ --output-dir outputs/ingestion ``` ### Mairui provider (online fetch as primary) ```bash python pipelines/ingest_real_data.py \ --provider mairui \ --mairui-licence YOUR_MAIRUI_LICENCE \ --mairui-market-code 399673.SZ \ --mairui-market-kind index \ --mairui-hs300-code 000300.SH \ --mairui-star50-code 000688.SH \ --mairui-csi1000-code 000852.SH \ --breadth-csv path/to/chinext50_breadth.csv \ --start-date 2018-01-01 \ --end-date 2026-04-09 \ --output-dir outputs/ingestion ``` If breadth fields are also served by a Mairui endpoint, you can replace `--breadth-csv` with: - `--mairui-breadth-url https://api.mairuiapi.com/xxx/{licence}` - optional `--mairui-breadth-map-json path/to/rename_map.json` If you do not trust an external breadth panel (or do not have one), you can derive breadth from constituent histories: ```bash python pipelines/ingest_real_data.py \ --provider mairui \ --mairui-licence YOUR_MAIRUI_LICENCE \ --mairui-market-code 399673.SZ \ --mairui-market-kind index \ --mairui-hs300-code 000300.SH \ --mairui-star50-code 000688.SH \ --mairui-csi1000-code 000852.SH \ --derive-breadth \ --breadth-index-symbol 399673 \ --breadth-min-active-constituents 20 \ --breadth-max-constituents 50 \ --breadth-cache-dir outputs/ingestion/raw/constituent_history \ --output-dir outputs/ingestion ``` Strict mode now includes a breadth-source integrity gate. Placeholder-like breadth inputs (for example, constant `weighted_ret_5 - eq_weight_ret_5`) are blocked before PIT publish. Output structure includes: - `outputs/ingestion/raw/*.csv` - `outputs/ingestion/raw/breadth_integrity_summary.json` - `outputs/ingestion/raw/breadth_derivation_summary.json` (when `--derive-breadth` is used) - `outputs/ingestion/staging/*.csv` - `outputs/ingestion/pit/chinext50_pit.csv` - `outputs/ingestion/pit/pit_quality_summary.json` - `outputs/ingestion/ingestion_manifest.json` ## Frozen Walk-Forward (Train-Select / Test-Freeze) `pipelines/frozen_hypothesis_validation.py` now runs a strict frozen-hypothesis process: 1. Evaluate predefined candidates only on each training window. 2. Select one winner by training utility (deterministic tie-break by candidate order). 3. Freeze that winner and evaluate the paired test window without re-selection. ### Candidate configuration Candidates can come from: - `config/regime.yaml` -> `frozen_validation.candidates` - optional CLI override file: `--candidates-json path/to/candidates.json` Window row requirements: - `frozen_validation.min_train_rows` (or `--min-train-rows`) - `frozen_validation.min_test_rows` (or `--min-test-rows`) If a window is too short, it is marked as skipped with an explicit status. ### Audit outputs `frozen_validation_board.csv` now includes: - window ranges (`train_*`, `test_*`) - `status` - `selected_candidate_id` - `selected_candidate_overrides` (serialized JSON) - prefixed train/test metrics such as `train_utility_total_score` and `test_utility_total_score` `frozen_validation_summary.json` now includes: - processed/skipped window counts - positive test-utility ratio - selected candidate distribution - status distribution ### Example ```bash python pipelines/frozen_hypothesis_validation.py \ --pit-csv path/to/chinext50_pit.csv \ --candidates-json path/to/frozen_candidates.json \ --min-train-rows 180 \ --min-test-rows 60 \ --output-dir outputs/frozen_validation_real ``` ## Real Walk-Forward Report Use `pipelines/real_walkforward_report.py` to generate a review-ready bundle from full PIT input: - `data_quality_summary.json` - `frozen_validation_board.csv` - `real_walkforward_summary.json` - `real_walkforward_report.md` ```bash python pipelines/real_walkforward_report.py \ --pit-csv path/to/chinext50_pit.csv \ --strict-data \ --output-dir outputs/real_walkforward_report ``` ## Event-Anchored Diagnostics `run_demo` now outputs transition-anchor diagnostics with explicit event taxonomy: - `crash_onset` - `false_rebound` - `true_repair` - `crowded_unwind` - `state_transition` (fallback class for other transitions) ### Event artifacts - `event_log.csv`: per-transition anchor details (`event_date`, `from_state`, `to_state`, `event_type`, forward returns, exposure context) - `event_summary.csv`: event-type grouped averages and counts Classification logic is rule-based on state transitions plus forward-window confirmation signals for rebound quality. ## Execution Layer Constraints and Tracking Diagnostics Backtest execution now includes configurable constraints for better ETF-style realism: - `trading.extreme_day_move_threshold`: absolute executed return threshold that triggers cost amplification - `trading.extreme_day_cost_multiplier`: multiplier applied to base trading cost on extreme days - `trading.gap_slippage_factor`: additive gap shock cost factor using `abs(gap_open) * turnover` New ledger diagnostics: - `tracking_difference`: `strategy_return_net - strategy_return_gross` - `tracking_error_20`: 20-day rolling std of `tracking_difference` New summary metrics: - `tracking_diff_mean` - `tracking_diff_abs_mean` - `tracking_error_20_p95` ### Execution Constraint Calibration Use `pipelines/calibrate_execution_constraints.py` to sweep execution parameters and output a recommendation: - `execution_calibration_grid.csv` - `execution_calibration_recommendation.json` ```bash python pipelines/calibrate_execution_constraints.py \ --pit-csv path/to/chinext50_pit.csv \ --cost-multipliers 1.0,1.25,1.5,1.75 \ --gap-slippage-factors 0.0,0.01,0.02,0.03 \ --output-dir outputs/execution_calibration ``` ### Additional Optional Concentration Inputs To improve crowding diagnostics, you can optionally provide: - `top1_contribution_5` - `top10_contribution_5` - `sector_concentration_20` ## Regime Lite (Small-Team Runtime) Use `pipelines/regime_lite_run.py` for a minimal operational workflow: - 3 states only: `risk_off`, `chop`, `trend` - fixed base exposures: `0.0`, `0.35`, `0.80` - daily exposure step cap: `0.20` - explicit execution profiles: - `baseline`: `lag1` timing, no overlay - `promoted_fast_entry_hold3`: prior promoted fixed-hold reference, based on `combo_fast_hold3` - `promoted_fast_entry_adaptive_extend`: current preferred profile after adaptive keep-vs-replace closure, based on `combo_fast_adaptive_extend` ```bash python pipelines/regime_lite_run.py \ --pit-csv path/to/chinext50_pit.csv \ --profile promoted_fast_entry_adaptive_extend \ --output-dir outputs/regime_lite ``` Current preferred lite runtime profile: - `promoted_fast_entry_adaptive_extend` - promotion decision artifact: `outputs/regime_lite_promotion_20260424/regime_lite_promotion_decision.json` - rationale: the bounded adaptive closure concluded `adaptive-replace-candidate`, selecting `combo_fast_adaptive_extend` to replace the prior fixed-hold reference while keeping `baseline` as rollback-safe reference - rollback/reference profile: `baseline` - inspect `promotion_decision.active_adaptive_mode` plus `regime_lite_summary.json -> execution_profile.adaptive_hold_mode` / `adaptive_hold_context` to understand the active bounded hold semantics before operating it Converged lite operational flow: 1. Run the preferred profile with `pipelines/regime_lite_run.py --profile promoted_fast_entry_adaptive_extend`. 2. Inspect `regime_lite_runtime_health.json` for bounded status `healthy` / `review` / `hold` / `rollback_recommended`. 3. Inspect `regime_lite_post_promotion_review.json` for bounded decision `keep_promoted` / `hold_and_review` / `recommend_rollback`. 4. In post-promotion review, treat `recent_window_evidence` as the primary decision basis; `full_history_reference` is reference context only, and `segmented_diagnostics` is for bounded diagnosis rather than override. 5. If health stays `healthy` and review stays `keep_promoted`, continue normal lite operation. 6. If health moves to `review` or review moves to `hold_and_review`, pause new tuning and inspect the bounded reasons before any change. 7. If health reaches `rollback_recommended` or review reaches `recommend_rollback`, switch back to `baseline` as the rollback-safe profile and keep the lite path scoped to that runtime handoff. Artifacts: - `regime_lite_daily_ledger.csv` - `regime_lite_summary.json` - `regime_lite_report.md` - `regime_lite_runtime_health.json` - `regime_lite_post_promotion_review.json` ### Execution Timing + Entry-Exit Experiments Run controlled A/B experiments for: - execution timing: `lag1` vs `fast_entry` - entry-specific exit overlay: short trend-entry hold floor with stop guard ```bash python pipelines/regime_lite_experiments.py \ --pit-csv path/to/chinext50_pit.csv \ --output-dir outputs/regime_lite_experiments ``` Artifacts: - `regime_lite_experiment_results.csv` - `regime_lite_experiment_summary.json` - `regime_lite_experiment_report.html` - `regime_lite_experiment_baseline_ledger.csv` - `regime_lite_experiment_best_ledger.csv` - `regime_lite_promotion_decision.json` The experiment board now separates: - recommendation candidate: best discovery-sample variant - promotion status: final `promote` / `hold` / `reject` decision from deterministic holdout validation - governance handoff target: the promoted runtime profile that must flow into bounded lite runtime health and post-promotion review ## Verification Project health check: ```bash py -m pytest -q ``` The repository now pins pytest collection to the main `tests/` directory, so historical deliverable bundles and backups do not pollute the default test run.