Soccer Predictions with the Dixon-Coles Model

Soccer, being the most popular sport in the world, has always been a subject of interest for data enthusiasts and analysts. Predicting the outcome of a match is not only essential for sports fans but also critical for sports betting industries. One of the widely accepted statistical approaches for predicting soccer matches is the Dixon-Coles model. In this post, we will dive deep into this model and discuss our implementation that generates probabilities of win/loss and goal count.

What is the Dixon-Coles Model?

The Dixon-Coles model, introduced by Mark Dixon and Stuart Coles in 1997, is a statistical method for modeling football (soccer) matches. The essence of the model is to quantify the attacking and defensive strengths of teams, while also adjusting for the fact that goals are rare events.

The primary reason the Dixon-Coles model gained prominence is its ability to address some of the limitations of the simpler Poisson distribution model for soccer match predictions. Specifically, the Dixon-Coles model adjusts for the fact that the number of low-scoring games (e.g., 0-0, 1-0, or 1-1) is under-predicted by the Poisson model, while the 2-0 and 2-1 scores are over-predicted.

Core Principles

Team Strengths: At its heart, the model calculates an attack and defense strength for each team based on their past performance. A team that frequently scores will have a high attacking strength, while a team that rarely concedes will have a high defensive strength.
Match Outcome Prediction: Using these calculated strengths, the model then predicts the expected number of goals a team will score in a particular match. By comparing the expected goals of both teams, one can predict the outcome of the match.
Adjustment for Low Scoring Games: As mentioned, the model has a mechanism to adjust for the under-prediction of low-scoring games. This adjustment is crucial for making the model more accurate, especially for soccer where low-scoring games are relatively common.

Implementing the Dixon-Coles Model

Here’s a step-by-step breakdown of our implementation:

Gathering Data: We collected historical match data, including teams, goals scored, and other relevant metrics.
Calculating Team Strengths: Using this data, we calculated attack and defense strengths for each team. These are dynamic and change as more data becomes available.
Predicting Match Outcomes: For a given match, the model calculates the expected goals for both teams. Using the Poisson distribution and the expected goals, we can then determine the probability of various match outcomes.
Generating Probabilities: We can extrapolate the expected goals to determine the probability of various events, including:
- Home team win
- Away team win
- Draw

Additionally, using the model’s foundation in the Poisson distribution, we can predict the probability of specific scorelines, like 2-1, 3-0, etc.

Results & Insights

By implementing the Dixon-Coles model, we achieved a deeper understanding of team strengths and match dynamics. The model’s probabilistic nature also allowed us to account for soccer’s inherent unpredictability.

Measuring the 2022-2023 English Premier League season

# calculate model win rate
df['Model Result'].sum() / df.shape[0]
0.5914285714285714

The model win rate..

# Count the occurrences of each winning team
winning_team_counts = df['WinningTeam'].value_counts()

# Convert the Series to a DataFrame for better display
results_table = winning_team_counts.reset_index()
results_table.columns = ['Team', 'Wins with ModelResult=1']
results_table

	Team	Wins with ModelResult=1
0	Manchester City	26
1	Arsenal	23
2	Manchester United	22
3	Liverpool	19
4	Newcastle	18
5	Aston Villa	17
6	Brighton	16
7	Tottenham	16
8	Brentford	14
9	Fulham	14

def win_rate_for_team(team_name, dataframe):
    team_games = dataframe[(dataframe['HomeTeam'] == team_name) | (dataframe['AwayTeam'] == team_name)]
    return team_games['Model Result'].sum() / team_games.shape[0]

# Get a list of unique teams
unique_teams = df['HomeTeam'].unique()

# Calculate win rates for all teams
win_rates = {team: win_rate_for_team(team, df) for team in unique_teams}

# Convert to dataframe
win_rate_df = pd.DataFrame(list(win_rates.items()), columns=['Team', 'Win Rate'])
win_rate_df

	Team	Win Rate
0	Southampton	0.628571
1	Brentford	0.400000
2	Brighton	0.542857
3	Chelsea	0.600000
4	Liverpool	0.571429
5	Manchester City	0.742857
6	Arsenal	0.685714
7	Aston Villa	0.485714
8	Wolves	0.514286
9	Nottingham Forest	0.485714

# Calculate average implied probability for each prediction type
avg_implied_prob = df.groupby('Prediction')['ImpProbability'].mean()
avg_implied_prob

Prediction	ImpProbability
away	56.707266
draw	44.728571
home	60.610735

# Group by 'Prediction' and calculate win rate
win_rate_by_side = df.groupby('Prediction').apply(lambda x: x['Model Result'].sum() / len(x)).reset_index()
win_rate_by_side.columns = ['Prediction', 'WinRate']
win_rate_by_side

	Prediction	WinRate
0	away	0.517986
1	draw	0.285714
2	home	0.651961

import statsmodels.stats.proportion as proportion

def analyze_predictions(df, prediction_type):
    # Filter data based on the provided prediction_type ('home' or 'away')
    predicted_games = df[df['Prediction'] == prediction_type]

    # Separate correct and incorrect predictions
    correct_preds = predicted_games[predicted_games['Model Result'] == 1.0]
    incorrect_preds = predicted_games[predicted_games['Model Result'] == 0.0]

    # Calculate the win rate
    win_rate = len(correct_preds) / len(predicted_games)

    # Confidence interval
    conf_interval = proportion.proportion_confint(len(correct_preds), len(predicted_games), alpha=0.05, method='normal')

    # Average implied probability for correct and incorrect predictions
    avg_correct_imp = correct_preds[f'{prediction_type.capitalize()}Imp'].mean() / 100
    avg_incorrect_imp = incorrect_preds[f'{prediction_type.capitalize()}Imp'].mean() / 100

    return {
        "Win Rate": f"{win_rate:.2%}",
        f"Average {prediction_type.capitalize()} Implied Probability for Correct Predictions": f"{avg_correct_imp:.2%}",
        f"Average {prediction_type.capitalize()} Implied Probability for Incorrect Predictions": f"{avg_incorrect_imp:.2%}",
        "95% Confidence Interval for Win Rate": (f"{conf_interval[0]:.2%}", f"{conf_interval[1]:.2%}")
    }

# For home predictions
home_analysis = analyze_predictions(df, 'home')
for key, value in home_analysis.items():
    print(f"{key}: {value}")

# For away predictions
print("\n")
away_analysis = analyze_predictions(df, 'away')
for key, value in away_analysis.items():
    print(f"{key}: {value}")

Win Rate: 65.20%
Average Home Implied Probability for Correct Predictions: 63.62%
Average Home Implied Probability for Incorrect Predictions: 54.98%
95% Confidence Interval for Win Rate: (‘58.66%’, ‘71.73%’)

Win Rate: 51.80%
Average Away Implied Probability for Correct Predictions: 58.55%
Average Away Implied Probability for Incorrect Predictions: 54.72%
95% Confidence Interval for Win Rate: (‘43.49%’, ‘60.11%’)