The current system has a real defensive effect, but the present end-to-end result is not primarily a threshold-tuning problem. It is first a system integrity problem:
breadth_score and crowding_score are effectively broken on the supplied PIT dataset because one z-scored component is constant, so the weighted sum becomes all-NaN.down_hazard, repair_hazard, and rebound_hazard collapse to ~0.5 everywhere because the raw hazard inputs are NaN and get filled to zero inside the sigmoid.chop, trend, risk_off), so repair and euphoric_late logic is mostly dead.baseline and pro_risk produce identical exposure paths on the supplied run because coarse quantization collapses their differences.Only after fixing these should you trust threshold tuning and objective redesign.
chop, trend, and risk_off.breadth_score non-null ratio = 0.0crowding_score non-null ratio = 0.0down_hazard, repair_hazard, rebound_hazard = 0.5 nearly everywherebaseline and pro_risk exposure paths are identicalmodel/scores.pyThe weighted score sums do not protect against NaN sub-components. If any sub-score is all-NaN, the whole composite score becomes all-NaN.
On the supplied PIT, concentration_spread_5 = weighted_ret_5 - eq_weight_ret_5 is constant at 0.002, so its rolling z-score has zero std and becomes all-NaN. This breaks both:
breadth_scorecrowding_scoreHazards are built from raw formulas that reference broken scores. Then they are fed through:
rolling_zscore(...)_sigmoid(series.fillna(0.0))This turns missing hazard information into the neutral constant 0.5, which prevents the system from noticing it is effectively blind.
The policy layer uses coarse quantization:
{0.0, 0.25, 0.50, 0.75, 1.0}As a result:
trend = 0.95 and trend = 1.00 both quantize to 1.0repair and chop parameter tweaks collapse to the same discrete levelsThat is why baseline and pro_risk can become identical even though their YAML values differ.
The frozen WF windows start in 2016 while the supplied PIT starts in 2020. So the first window is skipped, leaving only 2 processed windows. This is too thin for robust selection.
Current calibration score:
utility_total_score - 3*tracking_diff_abs_mean - 20*tracking_error_20_p95 - max_drawdown
On the supplied run:
max_drawdown is ~0.32 and dominates the scoreSo the calibration is effectively “pick the lowest cost / smallest MDD” rather than meaningfully trading off return, utility, and tracking.
The overall direction is still valid:
But the current implementation is not yet a true regime system. In practice, it behaves like:
So the global direction is not wrong, but the current bundle is not measuring what it thinks it is measuring.
model/scores.py, fill NaN at the component level or aggregate with NaN-safe sums.Current risk-off is likely too eager once hazards start working.
Recommended first pass:
down_hazard: 0.62 -> 0.70stress_score: 0.85 -> 0.950.72 -> 0.78Expected impact:
Current repair condition is too easy once repaired hazards become live.
Recommended first pass:
repair_hazard: 0.58 -> 0.62repair stress max: 0.85 -> 0.70d_stress <= 0, and add d_trend >= 0breadth_score >= 0.00Expected impact:
Current trend gate is too strict on signal but too weak on persistence.
Recommended first pass:
trend_score: 0.45 -> 0.30~0.35breadth_score: -0.05 -> 0.00 after bug fixstress_score: 0.45 -> 0.55Expected impact:
Current euphoric_late should be delayed, not early.
Recommended first pass:
crowding_score: 0.70 -> 0.82rebound_hazard: 0.68 -> 0.78Expected impact:
Current symmetric min_state_duration = 3 is too blunt.
Recommended:
4risk_off: 2-day confirmExpected impact:
This is one of the biggest practical blockers.
Recommended:
{0,0.25,0.5,0.75,1.0} with {0,0.1,0.2,...,1.0}Without this change, many policy experiments are fake because different raw exposures map to the same discrete level.
Current repair exposure is too timid if the goal is upside capture >= 0.60.
Recommended piecewise mapping:
0.300.450.600.75Example:
repair_hazard in [0.62, 0.70) and breadth_score >= 0.0: 0.45repair_hazard in [0.70, 0.80) and d_trend > 0: 0.60repair_hazard >= 0.80 and breadth_score > 0.25: 0.75Trend should be close to full risk unless stress or crowding says otherwise.
Recommended:
0.901.000.75Practical formula:
trend_base = 0.90trend_boost = +0.10 if breadth_score > 0.25trend_cut = -0.15 if crowding_score > 0.75[0.75, 1.00]This is the lever most directly tied to upside capture in the current broken topology.
Observed on the supplied run:
0.25 produces upside capture around 0.370.50 lifts upside capture toward 0.540.75 pushes upside capture above 0.70, but drawdown rises sharplyRecommended target for next round:
chop = 0.40~0.45 if using continuous exposure0.50 only after fixing state logicExpected impact:
max_daily_exposure_change: 0.25 -> 0.35 after quantization removal<= 12Use hard gates first, then a score.
Recommended hard constraints:
strategy_max_drawdown <= 0.70 * baseline_max_drawdownupside_capture >= 0.50 for every valid OOS windowupside_capture >= 0.55positive_window_ratio >= 0.67annual_turnover <= 12 unless annual return improves by at least +300 bpsOnly candidates that pass hard constraints are ranked.
Recommended selection score:
score = 0.35 * return_ratio + 0.30 * upside_score + 0.20 * dd_score + 0.10 * sharpe_delta_score + 0.05 * stability_score - turnover_penalty
Where:
return_ratio = clip(strategy_ann / baseline_ann, 0, 1.2)upside_score = clip(upside_capture / 0.60, 0, 1.2)dd_score = clip((baseline_mdd - strategy_mdd) / baseline_mdd / 0.35, 0, 1.2)sharpe_delta_score = clip((strategy_sharpe - baseline_sharpe + 0.10) / 0.20, 0, 1.2)stability_score = positive_window_ratioturnover_penalty = max(0, annual_turnover - 10) * 0.02This score is easier to interpret than the current utility-only selection.
The current formula is dominated by -max_drawdown, not by tracking penalties.
Use when execution assumptions are still approximate.
calib_A = utility_total_score + 0.40*annual_return + 0.20*upside_capture - 0.60*max_drawdown - 5*tracking_error_20_p95 - 1.5*tracking_diff_abs_mean
Use only when execution model is already close to production.
calib_B = utility_total_score + 0.30*sharpe - 0.40*max_drawdown - 2*max(0, tracking_error_20_p95 - 0.003) - 1*max(0, tracking_diff_abs_mean - 0.001)
This introduces tolerance bands so tiny tracking differences do not dominate selection.
Given current data start in 2020, do not pretend you have 2016 windows.
Recommended:
Example:
For every candidate, record:
breadth_score and crowding_score non-null ratio > 95%; hazards not stuck at 0.5down_hazard 0.70, stress 0.95, stronger crash overriderepair_hazard 0.62, breadth >= 0, d_trend >= 0, lower repair stress ceilingThe core direction is valid, but the current bundle is still partly a false negative on offense because two of the most important score channels are effectively broken and the policy search space is partly collapsed.
Fix the integrity issues first. After that, the most likely path to materially better upside capture is: