Your Backtest Is Lying to You

In 2020, Hou, Xue, and Zhang ran a systematic replication of 452 stock market anomalies from the academic literature. Sixty-five percent of them failed out-of-sample. It is one of the most important results in quantitative finance of the last decade, and it almost never appears in trading software marketing.

Harvey, Liu, and Zhu made a related point in 2016: the bar for statistical significance in academic finance is far too low. When researchers test hundreds of factors, some will look significant by chance alone. Most published factors do not survive multiple-testing correction.

The implications for any AI-augmented trading system are direct. The null hypothesis for a new signal should be: this does not work live. The burden of proof sits on the signal, not on skepticism. A strong backtest is, by default, evidence of a well-documented failure mode — survivorship bias, look-ahead bias, overfitting, or chance — before it is evidence of real edge.

The failure modes compound on each other. Survivorship bias is pervasive: we see the strategies that worked, not the ones that were tried and abandoned. Look-ahead bias is endemic to ML feature engineering, where the construction process quietly leaks data that would not have been available at trade time. Transaction cost neglect is nearly universal in academic backtests, and real-world slippage consistently drags live performance below the backtest curve.

Alpha decay adds a time dimension. Even for signals that are real, AQR research finds excess returns typically decay within 6 to 18 months of a strategy becoming known or widely adopted. A signal that worked from 2015 to 2020 may already have been arbitraged away.

Machine learning makes all of this worse. The feature-to-observation ratio in financial time series is brutally high — ten years of daily data is at most 2,500 observations, and adding 50 features gives the optimizer enormous room to fit noise. The backtest looks clean. The live P&L does not.

None of this means quantitative trading is impossible or that ML has no role in financial systems. A handful of factors — low volatility, value, momentum — have survived decades of scrutiny. The point is that the default posture should be skepticism, the validation standard should be strict out-of-sample testing, and any system that leads with backtested returns deserves a careful second read.

For APEX and NOVA, the discipline is straightforward: validation requires out-of-sample performance, simulated trading runs before live capital touches the system, and live results are tracked explicitly against model expectations. When the two diverge, we treat the gap as signal, not noise.