Machine LearningActive

Quantitative Sports Prediction Engine

A production-grade machine learning system that ingests 9+ years of football data, engineers predictive features, and generates match outcome predictions across Europe's top 5 leagues.

Complete system walkthrough: data pipeline, ML architecture, and results

14,000+
Matches Processed
42
Features Engineered
72-83%
Walk-Forward Accuracy
5
European Leagues

Overview

What started as a data pipeline evolved into a complete prediction system. The question shifted from "can historical football data reveal patterns?" to "can we actually beat the market?"

The answer required three things: clean data, rigorous backtesting, and honest out-of-sample validation. I built all three. The system now covers the complete workflow: data ingestion from multiple sources, feature engineering with 42 metrics per match, ensemble machine learning models, walk-forward backtesting, and live prediction generation.

System Architecture

01

Data Ingestion

Pull match data from FBref, Understat, and Football-Data.co.uk with rate limiting and error handling.

02

Feature Engineering

Calculate 42 features per match: rolling form, head-to-head, xG metrics, league position, and market-derived signals.

03

Ensemble Model

XGBoost + Random Forest with calibrated probabilities. Automatic weight optimisation for best combination.

04

Walk-Forward Validation

True out-of-sample testing: train on pre-2023 data, test on 2023-2026. No feature leakage, no look-ahead bias.

05

Prediction Output

Match probabilities, confidence scores, and accumulator candidates with selection criteria filtering.

Backtest Results

100 backtest iterations testing different strategies and parameters. All results from true out-of-sample testing.

StrategyAccuracyROI
Walk-Forward Baseline72.0%+6.5%
Exclude Bundesliga75.6%+10.2%
PL + La Liga OnlyBest94.1%+13.9%
Home Favorites Only78.3%+9.1%

Key finding: Premier League + La Liga only strategy showed remarkable 94.1% accuracy on qualifying high-confidence selections in the 1.10-1.50 odds range.

Accumulator Strategy

~60%
2-fold Win Rate
~45%
3-fold Win Rate
+102%
Bankroll Simulation ROI

The model excels at identifying heavy favorites. Filtering for high-confidence selections (55%+) in the 1.10-1.50 odds range with market agreement produced consistent positive ROI across 100 backtest iterations.

Multi-League Coverage

Premier League
England
La Liga
Spain
Bundesliga
Germany
Serie A
Italy
Ligue 1
France

Data Sources

FB

FBref

Comprehensive match statistics, team performance data, and historical results.

US

Understat

Advanced metrics including expected goals (xG), shot maps, and match events.

FD

Football-Data

Historical odds from multiple bookmakers for backtesting betting strategies.

Walk-Forward Validation

True Out-of-Sample Testing

The most important thing I learned: backtesting without temporal discipline is worthless. Every result in this system comes from true out-of-sample testing. Train on data before 2023, test on 2023-2026, data the model never saw during training. No feature leakage, no look-ahead bias.

Tech Stack

Python 3.12+
Core language
Pandas & NumPy
Data manipulation
scikit-learn
ML framework
XGBoost
Gradient boosting
PostgreSQL
Data storage (Supabase)
Next.js
Dashboard frontend
pytest
Test framework (TDD)
GitHub Actions
CI/CD

Key Achievements

  • Walk-forward validation protocol ensuring all results are truly out-of-sample
  • 94.1% accuracy on filtered heavy-favorite selections (Premier League + La Liga)
  • Multi-league expansion from EPL-only to 5 leagues with 614 fixtures added in a single session
  • Incremental refresh system reducing API calls by 95%
  • 100 backtest iterations testing different strategies and parameters
  • Live prediction system generating picks for 48 matches across 5 leagues
  • Bankroll simulations showing +102% ROI potential with combined betting strategy
  • Full test coverage with TDD methodology throughout