A production-grade machine learning system that ingests 9+ years of football data, engineers predictive features, and generates match outcome predictions across Europe's top 5 leagues.
Complete system walkthrough: data pipeline, ML architecture, and results
What started as a data pipeline evolved into a complete prediction system. The question shifted from "can historical football data reveal patterns?" to "can we actually beat the market?"
The answer required three things: clean data, rigorous backtesting, and honest out-of-sample validation. I built all three. The system now covers the complete workflow: data ingestion from multiple sources, feature engineering with 42 metrics per match, ensemble machine learning models, walk-forward backtesting, and live prediction generation.
Pull match data from FBref, Understat, and Football-Data.co.uk with rate limiting and error handling.
Calculate 42 features per match: rolling form, head-to-head, xG metrics, league position, and market-derived signals.
XGBoost + Random Forest with calibrated probabilities. Automatic weight optimisation for best combination.
True out-of-sample testing: train on pre-2023 data, test on 2023-2026. No feature leakage, no look-ahead bias.
Match probabilities, confidence scores, and accumulator candidates with selection criteria filtering.
100 backtest iterations testing different strategies and parameters. All results from true out-of-sample testing.
| Strategy | Accuracy | ROI |
|---|---|---|
| Walk-Forward Baseline | 72.0% | +6.5% |
| Exclude Bundesliga | 75.6% | +10.2% |
| PL + La Liga OnlyBest | 94.1% | +13.9% |
| Home Favorites Only | 78.3% | +9.1% |
Key finding: Premier League + La Liga only strategy showed remarkable 94.1% accuracy on qualifying high-confidence selections in the 1.10-1.50 odds range.
The model excels at identifying heavy favorites. Filtering for high-confidence selections (55%+) in the 1.10-1.50 odds range with market agreement produced consistent positive ROI across 100 backtest iterations.
Comprehensive match statistics, team performance data, and historical results.
Advanced metrics including expected goals (xG), shot maps, and match events.
Historical odds from multiple bookmakers for backtesting betting strategies.
True Out-of-Sample Testing
The most important thing I learned: backtesting without temporal discipline is worthless. Every result in this system comes from true out-of-sample testing. Train on data before 2023, test on 2023-2026, data the model never saw during training. No feature leakage, no look-ahead bias.