CageMetrics

The CageMetrics fight predictor

A gradient-boosted classifier predicts the winner of every upcoming UFC bout. It uses 37 features computed before the fight starts. This page describes the data, the model choice, training, and current performance on fights the model was never shown.

Last retrained Jun 17, 2026

GBT accuracy 67.2% on 354 holdout fights since Jun 14, 2025
Δ vs Elo alone +10.2pp Elo by itself gets 57.1% on the same fights
Trained on 8057 labelled UFC fights before Jun 14, 2025
Features 40 per fight, all leak-free

What we're predicting

For every completed UFC bout in our database, the label is binary: 1 if fighter A won, 0 if fighter B did. The model learns to predict that label as a probability between 0 and 1. Draws and no-contests are excluded from training because they carry no directional signal.

The probability is the model's raw output. The homepage and event pages turn it into the "CageMetrics pick X%" badge on upcoming fights.

Why prediction in MMA is difficult

Several things make this a hard problem.

A fight can turn on a single punch in the last ten seconds. No amount of feature engineering removes that variance from the data. We can model the bulk of the signal; the tail is irreducibly random.

The dataset is also small by modern machine-learning standards. About 8,500 UFC fights have full per-fighter stats. That is enough to train a careful tabular model. It is well short of what deep learning typically uses.

Roster turnover is heavy. The lightweight division in 2008 has almost no fighters in common with the same division in 2026. The model has to extrapolate across eras. Every fighter in the data also passed the UFC's signing bar at some point, so the sample is filtered. The model cannot see the alternate-reality version of an unsigned regional fighter who would have actually won.

The betting market is the practical ceiling. It reaches around 69% accuracy on UFC outcomes using every piece of information humanly available. Anything in the low 60s is doing real work.

How we chose the model family

The dataset is tabular, mixed-scale, and small. Several families of model were on the table.

Logistic regression

Simple, fast, and easy to read. It assumes linear relationships in feature space, which several of our features break. Age has a sweet spot in the late 20s, not a monotonic effect. Days since last fight is fine at 90 and fine at 365, then becomes a concern at 1,000. Capturing these effects in a logistic model would require manually engineered polynomial or bucketed features, which gets fragile.

Random forests

Random forests handle nonlinearity, do not need feature scaling, and cope with missing values. Each tree is trained independently against the full label, so the ensemble tends to reduce variance at the expense of bias, and the probability output is less reliable than gradient boosting. We care about that. A fighter we list as a 70% favourite should win roughly 70 of every 100 such matchups for the page to be honest.

Neural networks

The standard advice for tabular data at this size is to skip them. With 8,500 rows and 37 features, a neural network would over-fit before it found anything a tree-based model could not. The interpretability story is also worse: there is no clean equivalent of "feature importance" to point at when explaining a pick.

Gradient-boosted trees

Each tree corrects the residual errors of the ensemble so far. The result is well-calibrated probabilities and automatic learning of interactions. Tree splits are scale-invariant, so the model is indifferent to Elo living near 1,500 and win rate sitting between 0 and 1. sklearn's GradientBoostingClassifier trains the whole model in under a second on our data, which lets us retrain from scratch every day instead of persisting weights.

Why not XGBoost / LightGBM / CatBoost

The fancier boosting libraries would probably gain a few basis points of accuracy. The added complexity (native libraries, separate model artifacts, version pinning) is not worth it when the model retrains in under a second and the accuracy gap is small. sklearn ships with the existing container and the production code path is one fit() call.

The features

Each fight becomes one vector of 40 numeric features. Every value is computed from data that exists before the bout starts.

Skill (3)

Pre-fight Elo for each fighter and the gap between them. This group dominates the feature importances.

Rolling form, last 5 fights (26)

Mirrored A and B blocks. Each block has the fighter's win rate, finish rate, decision rate, average chaos score, per-minute significant strikes (landed, attempted, accuracy), per-minute takedowns (landed, attempted, accuracy), control-time fraction, knockdowns per fight, submission attempts per fight, prior fight count, and days since their last bout.

Bio differentials (3)

Reach, height, and age difference, all computed as A minus B.

Raw ages (2)

Each fighter's absolute age in addition to the diff. Together these let the tree learn that a 10-year gap between a 22-year-old and a 32-year-old plays out differently from the same gap between a 32-year-old and a 42-year-old, even though the diff is identical.

Context (2)

Title-fight flag and the scheduled rounds (3 for a regular bout, 5 for a championship or main event).

Leak prevention

The easiest way to mislead yourself on this kind of backtest is to leak the outcome into the features. Our rolling-window stats use a fighter's last five UFC bouts. If one of those five is the fight we are trying to predict, the model has been told the answer.

The features module walks fights in date order. For every fight, the feature vector is built first. Only after that is the fight added to each fighter's rolling history. The last-five-fights window for any given prediction therefore cannot contain the target. The same rule applies to pre-fight Elo: we use the rating from the moment before the K-factor update for that bout, not the rating after.

Training methodology

The numbers at the top of this page come from a one-year holdout split.

The holdout is recomputed at the end of every daily training run. The numbers on this page slide forward in time with the cron schedule rather than freezing at a fixed cutoff.

Hyperparameters

The model uses 400 boosting rounds with a tree depth of 2 and a learning rate of 0.05. Each tree sees a 0.8 subsample of the training data. Every leaf contains at least 20 fights. The random seed is fixed so results are reproducible across runs.

The tree depth was 3 until we added the raw-age features. After that, depth-3 started over-fitting on the larger feature set, and we cut to depth-2.

Results: head to head with Elo

On the 354 held-out fights, the model beats raw Elo on every metric we track.

MetricGBTElo alone
Accuracy 67.23% 57.06%
Log-loss 0.6365 0.6691
Brier score 0.2223 0.2379

Results: vs the betting market

A fight predictor is most usefully judged against the betting market. The market sees what we see and a lot we do not. It sets the practical ceiling. The comparison below is restricted to the 295 holdout fights where Polymarket priced both fighters.

Metric Market GBT Elo alone
Accuracy 70.17% 67.46% 56.61%
Log-loss 0.5959 0.6366 0.6704
Brier score 0.2046

The market beats the model by about 2.7 percentage points on this slice (70.2% vs 67.5%). That gap is roughly the information advantage the market has. The model closes about 80% of the spread between raw Elo and the market line. Elo gets us partway there. Layering the rolling-form, bio, and age features on top closes about two-thirds of the rest.

What the market sees that we do not

The day-of weigh-in. Open-workout footage. Movement on the line itself, which encodes professional bettor consensus. Camp interviews. Sometimes late injuries that never reach public stats. None of that is in our feature pipeline. The goal is not to match the market. It is to close enough of the spread that the prediction is informative on its own.

Calibration

Accuracy alone does not tell you whether the probabilities are honest. A model that picks every favourite at 99% confidence and gets 65% right is useless when asked how sure to be. The table below bins predictions by the model's confidence in the favourite and shows how often that favourite actually won.

Confidence bin n Model said Favourite won Market avg (n) Market won
0.50–0.60 167 55.0% 61.7% 55.0% (137) 62.8%
0.60–0.70 117 64.5% 71.8% 65.0% (96) 70.8%
0.70–0.80 59 73.4% 72.9% 74.1% (52) 71.2%
0.80–0.90 11 81.6% 72.7% 75.9% (10) 80.0%

The Market avg and Market won columns are restricted to fights in each bin that Polymarket priced on both sides. Sample size is in parentheses. The market column reports the implied probability of the same fighter the model picked, so a row's Model said vs Market avg shows whether the model agrees with the market on that fighter. A Market avg under 50% means the market sided with the underdog on those fights.

A row is well calibrated when Model said and Favourite won match within a couple of points. The 0.8–1.0 bin is usually a little optimistic: the model treats an 85% favourite as closer to a sure thing than the data actually warrants. This is a known pathology of any model trained on a small sample. Extreme probabilities get extrapolated rather than measured.

What the model actually uses

Feature importance is the share of the model's splitting power each feature delivered, normalised to the top feature. The top 15:

Elo rating gap (A − B)
0.240
Age difference (years)
0.157
Fighter A — days since last fight
0.064
Fighter B — days since last fight
0.053
Fighter A — age
0.046
Fighter B — age
0.041
Fighter A — avg chaos
0.029
Fighter B — sig strikes thrown / min
0.029
Fighter A — control time %
0.028
Fighter B — avg chaos
0.026
Fighter A — sig-strike accuracy
0.026
Fighter A — sig strikes landed / min
0.025
Fighter B — sig-strike accuracy
0.023
Fighter B — control time %
0.021
Fighter B — Elo rating
0.019

Two patterns worth noting. The Elo gap is by far the biggest single feature, which is why we treat Elo as the foundation and the model as a correction layer rather than the other way around. Age also matters in more than one way: the age difference between fighters is high in the importances, but each fighter's individual age is also high. A 10-year gap from 22 to 32 plays differently from the same gap at 32 to 42, even though the difference is identical. We feed the model both the raw ages and the difference so it can learn the absolute-age effect.

Things we tried that didn't work

Not every idea that looked good on paper survived a seed-averaged ablation.

We tried using rolling decision rounds-won as a model feature. The same metric drives the continuous decision-MOV in our Elo. As a separate predictor input it looked promising at one random seed (about +0.8pp). Averaged across seven seeds it came out as a wash or a small drag. The signal it carries, decision dominance, is already captured by the existing decision-rate, average-chaos, and win-rate features. It still does real work inside the Elo, so we kept it there and dropped it from the predictor.

An earlier version of the Elo applied an inactivity decay, bleeding a fighter's rating toward 1,500 after a year of layoff. It backtested at +1.6pp accuracy. It also predicted Jon Jones would walk into the Gane fight with 427 fewer Elo points than he actually had, and would have systematically punished every comeback the sport has produced. We reverted it.

Tree depth was 3 when the dataset was smaller. After adding raw ages, depth-3 began over-fitting. We dropped to depth-2 and the lift came back.

What the model can't see

Some things a useful fight predictor would need are not in our data.

Style matchups. The model knows a fighter's takedown rate and sig-strike rate. It does not know that this wrestler's takedowns will fail against that defensive specialist. Head-to-head style data is sparse and difficult to model directly.

Camp and injury context. Recent sparring footage, weight-cut updates, undisclosed injuries: none of it reaches public stats. The betting market picks up some of this through line movement.

Live conditions. Altitude, late opponent changes, rehydration after weigh-in. None of these are in the feature set.

None of these are easy to add without compromising the data pipeline. The gap to the betting market is the budget we have for missing context.