AI Trading Platform Update: Building a Bulletproof Evaluation Framework
What Changed Since December 9th?
In our last post, we celebrated improving our win rate from 16.6% to 20.1%. But we had a nagging question: Are these results real, or are we fooling ourselves?
The past two days were spent answering that question. Spoiler: We were partially fooling ourselves.
The Problem We Discovered
Our previous backtesting approach had subtle issues:
- No out-of-sample testing - We tested on the same data we trained on
- Missing costs - We ignored trading fees, spread, and slippage
- Small samples - We made decisions based on 10-20 trades
- No reproducibility - Results varied between runs
What We Built (Dec 9-11)
1. Walk-Forward Evaluation System
We built a complete time-series evaluation framework that guarantees no data leakage:
``` 13 independent evaluation windows 168 true out-of-sample trades Zero overlap between training and testing data ```
The system rolls through time, training only on past data and testing on future data—just like real trading.
2. Realistic Cost Model
Every trade now accounts for real trading costs:
| Asset | Fees | Spread | Slippage | Total |
|---|---|---|---|---|
| Stock | 0.10% | 0.05% | 0.05% | ~0.20% |
| Crypto | 0.20% | 0.10% | 0.15% | ~0.45% |
| Forex | 0.02% | 0.08% | 0.03% | ~0.13% |
The backtester now tracks both gross and net PnL separately.
3. Minimum Trade Guards
The system now refuses to report statistics without enough data:
```python MIN_TRADES_FOR_STATS = 30 # Basic statistics MIN_TRADES_FOR_CONFIDENCE = 50 # Statistical significance ```
No more optimizing based on 15 trades.
4. Comprehensive Metrics Suite
New metrics module with:
- Wilson confidence intervals for win rate
- Expectancy calculation
- Profit Factor
- Sharpe and Sortino ratios
- Maximum drawdown tracking
5. Reproducibility Framework
Every experiment now logs:
- Git commit hash and branch
- Random seeds (Python, NumPy, PyTorch)
- Full configuration
- Results with timestamps
6. Parameter Optimization Tools
- Stop-Loss Sweep - Test ATR multipliers systematically
- Regime Detection - Classify market conditions
- Ablation Harness - Compare model configurations fairly
7. Database Migration
Added new fields to track cost data:
```sql result_pct_gross – PnL before costs costs_json – Cost breakdown exit_reason – stopped/target/timeout bars_held – Trade duration ```
The Honest Results
With proper out-of-sample evaluation:
| Metric | Previous Claim | Actual OOS |
|---|---|---|
| Win Rate | 20.1% | 20.8% |
| Profit Factor | ~1.2 (estimated) | 0.92 |
| Sharpe Ratio | Not measured | -0.46 |
| Total OOS Trades | - | 168 |
The truth: While our win rate held up, we’re actually losing money when costs are included.
New Modules Created
``` src/ ├── reproducibility.py # Seed management, experiment logging ├── metrics.py # Trading metrics with confidence intervals ├── walk_forward.py # Time-series evaluation pipeline ├── ablation.py # Model comparison harness ├── regime.py # Market regime classification └── stop_loss_sweep.py # ATR parameter optimization
tests/ └── test_core.py # 20 unit tests (all passing) ```
Tests: All Passing
```bash $ pytest tests/test_core.py -v ======================== 20 passed in 1.60s ======================== ```
Tests cover:
- Cost model calculations
- Wilson confidence intervals
- Walk-forward window generation
- OOS data isolation
- Trade result conversion
What This Means
Bad news: We’re not profitable yet (Profit Factor 0.92).
Good news: We now have the tools to find real edge:
- Every future improvement will be validated out-of-sample
- Costs are included from the start
- Results are reproducible
- Statistics are meaningful
Next Steps
- Improve signal generation (current 20.8% win rate needs to reach ~33% for 2:1 R/R profitability)
- Test different indicator combinations
- Filter by market regime
- Explore longer time horizons
Learn more about the platform: AI Trading Platform
Building a profitable trading system is hard. Building an honest evaluation framework is the first step to knowing if you’ve actually succeeded.