by ABXK.AI AI Trading

AI Trading Platform Update: Building a Bulletproof Evaluation Framework

machine-learningtradingbacktestingpythonevaluation

What Changed Since December 9th?

In our last post, we celebrated improving our win rate from 16.6% to 20.1%. But we had a nagging question: Are these results real, or are we fooling ourselves?

The past two days were spent answering that question. Spoiler: We were partially fooling ourselves.

The Problem We Discovered

Our previous backtesting approach had subtle issues:

  1. No out-of-sample testing - We tested on the same data we trained on
  2. Missing costs - We ignored trading fees, spread, and slippage
  3. Small samples - We made decisions based on 10-20 trades
  4. No reproducibility - Results varied between runs

What We Built (Dec 9-11)

1. Walk-Forward Evaluation System

We built a complete time-series evaluation framework that guarantees no data leakage:

``` 13 independent evaluation windows 168 true out-of-sample trades Zero overlap between training and testing data ```

The system rolls through time, training only on past data and testing on future data—just like real trading.

2. Realistic Cost Model

Every trade now accounts for real trading costs:

AssetFeesSpreadSlippageTotal
Stock0.10%0.05%0.05%~0.20%
Crypto0.20%0.10%0.15%~0.45%
Forex0.02%0.08%0.03%~0.13%

The backtester now tracks both gross and net PnL separately.

3. Minimum Trade Guards

The system now refuses to report statistics without enough data:

```python MIN_TRADES_FOR_STATS = 30 # Basic statistics MIN_TRADES_FOR_CONFIDENCE = 50 # Statistical significance ```

No more optimizing based on 15 trades.

4. Comprehensive Metrics Suite

New metrics module with:

  • Wilson confidence intervals for win rate
  • Expectancy calculation
  • Profit Factor
  • Sharpe and Sortino ratios
  • Maximum drawdown tracking

5. Reproducibility Framework

Every experiment now logs:

  • Git commit hash and branch
  • Random seeds (Python, NumPy, PyTorch)
  • Full configuration
  • Results with timestamps

6. Parameter Optimization Tools

  • Stop-Loss Sweep - Test ATR multipliers systematically
  • Regime Detection - Classify market conditions
  • Ablation Harness - Compare model configurations fairly

7. Database Migration

Added new fields to track cost data:

```sql result_pct_gross – PnL before costs costs_json – Cost breakdown exit_reason – stopped/target/timeout bars_held – Trade duration ```

The Honest Results

With proper out-of-sample evaluation:

MetricPrevious ClaimActual OOS
Win Rate20.1%20.8%
Profit Factor~1.2 (estimated)0.92
Sharpe RatioNot measured-0.46
Total OOS Trades-168

The truth: While our win rate held up, we’re actually losing money when costs are included.

New Modules Created

``` src/ ├── reproducibility.py # Seed management, experiment logging ├── metrics.py # Trading metrics with confidence intervals ├── walk_forward.py # Time-series evaluation pipeline ├── ablation.py # Model comparison harness ├── regime.py # Market regime classification └── stop_loss_sweep.py # ATR parameter optimization

tests/ └── test_core.py # 20 unit tests (all passing) ```

Tests: All Passing

```bash $ pytest tests/test_core.py -v ======================== 20 passed in 1.60s ======================== ```

Tests cover:

  • Cost model calculations
  • Wilson confidence intervals
  • Walk-forward window generation
  • OOS data isolation
  • Trade result conversion

What This Means

Bad news: We’re not profitable yet (Profit Factor 0.92).

Good news: We now have the tools to find real edge:

  1. Every future improvement will be validated out-of-sample
  2. Costs are included from the start
  3. Results are reproducible
  4. Statistics are meaningful

Next Steps

  1. Improve signal generation (current 20.8% win rate needs to reach ~33% for 2:1 R/R profitability)
  2. Test different indicator combinations
  3. Filter by market regime
  4. Explore longer time horizons

Learn more about the platform: AI Trading Platform


Building a profitable trading system is hard. Building an honest evaluation framework is the first step to knowing if you’ve actually succeeded.