The CageMetrics Elo
A modified Elo rating updates every fighter's number after every UFC fight. The math underneath is the same system Arpad Elo wrote for chess in the 1960s. Our modifications are about making the rating respond appropriately to the different kinds of UFC wins. A 30-second knockout should not move the rating by the same amount as a five-round split decision.
Live snapshot
What Elo measures
Elo assigns every player in a competitive pool a single number representing their skill relative to the rest of the pool. Two ratings, passed through a logistic curve, give you the predicted probability that one player beats the other. After each match both ratings update. Winners gain, losers lose. The size of the swing depends on how surprising the result was. Beating a much higher-rated opponent moves your rating more than beating someone you were already expected to beat.
The math:
expected_a = 1 / (1 + 10^((rating_b - rating_a) / 400))
rating_a += K × (score_a - expected_a)
Here score_a is 1 for a win, 0.5 for a draw, 0 for a loss. K is the step size that controls how aggressively the rating responds to new results.
Why classical Elo fits MMA poorly
Chess Elo treats every game as a binary outcome. MMA is not that. A 30-second KO and a razor-thin split decision both count as wins, but they tell you very different things about the gap between the two fighters. Treating them identically throws away information the bout actually produced.
Three problems with using the chess version on MMA.
First, margin of victory carries real information. A 30-27 / 30-27 / 30-27 win was clearly more decisive than a 29-28 / 29-28 / 28-29 win on the same cards. A rating system that updates them by the same amount is leaving signal on the table.
Second, new fighters need to converge to their true rating quickly. A debutant whose actual skill is around 1,700 should not need 20 fights to climb out of the default 1,500.
Third, MMA is high variance. Even with margin-of-victory scaling, single-bout noise stays high. The rating system has to handle that gracefully.
What we changed
1. Provisional K-factor for newcomers
A fighter's first 5 UFC bouts use K = 96 instead of the standard 64. Their early rating is the least trustworthy number we have, so larger updates let it converge faster to whatever it should be. After 5 fights the K-factor drops to the standard value and the rating stabilises.
2. Margin-of-victory scaling on K
Rather than a single K for every result, K is multiplied by a factor that depends on how the fight was won.
- Clean finish (KO/TKO, submission, doctor's stoppage): K ×
1.80 - Decision: K × a continuous value between
0.50(razor-thin split) and1.80(perfect sweep on every card). The interpolation is described below. - DQ, overturned, or no-contest: K × 1.00 (the baseline)
The multipliers came from a parameter sweep over every UFC fight in our database. The sweep scores each combination on accuracy, log-loss, and Brier.
3. Continuous decision-MOV via judge rounds-won
This is the recent change worth dwelling on. It also makes our rating behave differently from other public MMA Elos.
A decision is a vote. The three judges score the fight round by round under the 10-point must system, and whoever has the higher total on two of three cards wins. ufcstats publishes those totals as part of every decision result. On the Reyes vs Walker bout in March 2024, for example, the scorecards read Sal D'Amato 28 - 29. Solimar Miranda 29 - 28. Junichiro Kamijo 28 - 29.
Under the 10-point must system, the winner of each round scores 10 and the loser scores 9 (or 8, or 7, when the round is dominant enough). That lets us recover the number of rounds each fighter won on each judge's card directly from the totals:
rounds_won = max(0, score − 9 × scored_rounds)
The Reyes split decoded (loser-score is listed first by ufcstats convention, then winner):
- Walker (lost the bout): 1 + 2 + 1 = 4 rounds across the three cards
- Reyes (won the bout): 2 + 1 + 2 = 5 rounds
That is a fair read of a razor-thin split. Reyes won because two of three judges had him ahead on points, but the judge-round count was essentially a coin flip: 5 of 9 against 4 of 9. Across the 3956 parsed decisions in our database, decisive wins land at 7 to 9 of 9 rounds. The closest splits push down toward 5 of 9.
The fraction (winner-rounds-won divided by total-judge-rounds) drives the continuous decision multiplier. We interpolate linearly between 0.50 at fraction 0.5 (a hypothetical even split on the rounds) and 1.80 at fraction 1.0 (a 30-27 / 30-27 / 30-27 shutout). The shape:
A 30-27 sweep on all three cards (9 of 9 rounds) moves the rating by K × 1.80, about the same as a clean finish. A 29-28 / 29-28 / 28-29 squeaker (5 of 9 rounds) moves it by something closer to K × 0.50, which is less than a flat-band unanimous decision used to. The rating now reflects how dominant the win actually was instead of treating every decision as identical.
ufcstats lists the judge totals as "loser's score - winner's score" rather than "fighter A - fighter B". We figured this out by cross-checking against bouts where the eventual winner appeared on either side. The scorecards are oriented per fight using the recorded winner.
How the multipliers were tuned
The parameter sweep runs the full Elo system over every completed UFC fight chronologically for each combination of multipliers. It reports four numbers per combination.
- Accuracy on fights where both fighters had at least two prior UFC bouts. The minimum-priors filter removes early newcomer fights, where any model is mostly guessing.
- Log-loss, which penalises confidently wrong predictions more than mildly wrong ones.
- Brier score, another calibration measure that captures how far the probabilities are from the actual outcomes.
- The difference vs the K-only baseline, so we can see whether each new lever helps.
The continuous decision-MOV beat the old flat unanimous-×1.30 / split-×0.65 bands by about 0.21 percentage points of accuracy. It was also better on log-loss and Brier on the most recent backtest pass (8,700+ completed fights, 4,800+ established). We picked the 0.50 / 1.80 endpoints because they were the calibrated peak in the search. Pushing further (close 0.30, sweep 2.50) gained another 0.04 percentage points of accuracy but hurt log-loss. The probabilities got worse even though the binary picks improved. We chose the row that improved every metric.
Vs vanilla Elo
The natural question is how much these modifications actually buy you. We ran the same backtest on both systems. One was the production CageMetrics weights. The other was a vanilla Elo with a flat K-factor of 40, no provisional bump, and no margin-of-victory scaling. Both ran over the same 8,700-plus UFC fights chronologically and were scored on the 4,877 bouts where both fighters had at least two prior UFC results.
| Metric | CageMetrics Elo | Vanilla Elo (K=40) |
|---|---|---|
| Accuracy on established fights | 58.38% | 56.04% |
| Log-loss | 0.6778 | 0.6822 |
| Brier score | 0.2419 | 0.2446 |
The modifications gain 2.34 percentage points of accuracy and tighten both calibration metrics. In practical terms the modified system gets about 58 of every 100 picks right where the vanilla version gets about 56. The provisional K-factor does most of its work in the first half-dozen fights of a newcomer's career. The MOV multipliers do theirs across the board, with the continuous decision scaling contributing most of the late-stage gain.
The full parameter set
| Parameter | Value | Role |
|---|---|---|
| Starting rating | 1500 | Every new fighter enters with this rating |
| Base K-factor | 64 | Step size for established fighters |
| Provisional K-factor | 96 | Larger step for a fighter's first few bouts |
| Provisional bouts | 5 | Fights of provisional K before downshifting |
| Title-fight multiplier | 1.00 | Extra weight for title fights (currently a no-op; tested, didn't help) |
| Finish multiplier | 1.80 | K scaling on clean finishes |
| Decision: close end | 0.50 | K scaling at fraction 0.5 (razor-thin) |
| Decision: sweep end | 1.80 | K scaling at fraction 1.0 (every judge-round) |
| Unanimous fallback | 1.30 | Used for decisions with no parsed scorecard |
| Split/majority fallback | 0.65 | Used for split/majority decisions with no scorecard |
The current top of the board
To anchor what these numbers look like on real fighters, here is the active pound-for-pound top:
| # | Fighter | Division | Elo | Bouts |
|---|---|---|---|---|
| 1 | Jon Jones | Heavyweight | 2269 | 24 |
| 2 | Islam Makhachev | Welterweight | 2241 | 18 |
| 3 | Charles Oliveira | Lightweight | 2123 | 37 |
| 4 | Ciryl Gane | Heavyweight | 2103 | 14 |
| 5 | Justin Gaethje | Lightweight | 2082 | 16 |
| 6 | Kamaru Usman | Welterweight | 2082 | 19 |
| 7 | Alexander Volkanovski | Featherweight | 2072 | 18 |
| 8 | Ilia Topuria | Lightweight | 2069 | 10 |
The full board, with division-by-division leaderboards alongside the pound-for-pound view, is at /rankings/.
What we tried and didn't keep
A title-fight bonus. We tested multiplying K by 1.10, 1.25, and 1.50 for championship fights. None of them helped accuracy or log-loss. The bonus is still in the parameter set at ×1.00 in case a future tune wants to revisit it.
An inactivity decay on Elo. Bleeding a fighter's rating toward 1,500 after a year off backtested at +1.66 percentage points of accuracy. It also penalised every successful comeback the sport has produced. Jon Jones would have walked into the Gane fight having shed 427 points of Elo and then won the bout from a 1,717 rating. We reverted the feature: the average lift was not worth the systematic bias against returning fighters.
An aggressive accuracy-peak MOV. A higher finish multiplier (×2.00) with a lower split multiplier (×0.50) earned 0.06 percentage points more accuracy than what we ship. It hurt log-loss and Brier, so the implied probabilities got worse even though the binary picks edged up. We chose the calibrated peak instead.
What it can't see
Strength of schedule beyond opponent Elo. Elo compresses everything about an opponent into one number. A grinder who beats other wrestlers but loses to strikers will look like a stylistic match to both. Head-to-head style data is not in the model.
Weight class transitions. A fighter moving up two weight classes carries their old Elo with them. There is no penalty for the size disadvantage. These cases are rare enough that we have not corrected for them.
Draws and no-contests. Both get scored 0.5 because the source data does not distinguish them. Rare results, small impact.
10-8 and 10-7 rounds. The rounds-won formula slightly under-counts when a winner sweeps via a 10-8 round, because the loser's non-10 round still adds to the loser's total. The qualitative ordering is preserved and the loser's total is clamped to 0, so the practical impact is small. We accept the approximation rather than try to reverse-engineer it from the totals.
The next layer (Elo combined with rolling form, age, and a handful of other features in a gradient-boosted classifier) closes about two-thirds of the remaining gap to the betting market. That is described on the predictor methodology page.