Predicting English Premier Leagues Games Using Transfermarkt Market Values

Intro

For good or bad, the world of modern Football (Soccer) is driven by money.

If I can quote Soccer Politics:

If Football is a religion, then to many, money is god. Each week, millions of dollars change hands between clubs and players, fans and ticket offices, and sponsors and clubs. Professional Football does indeed live up to the capitalist ideal; that if there is money to be made, someone will find a way to make it.

In this article, I’ll demonstrate that the multi-million transfers and wages are not just part of a marketing strategy for selling T-Shirts in South America.

The market value of a football team has a genuine correlation with the actual team performance.

To prove that, I’ll train a straightforward Ordinal Logistic Regression model based only on Transfermarkt market values for the home and away team in the English Premier League.

Then, I’ll use the model to predict the outcomes of another set of games.

In the end, I’ll compare the results with the ones by the bookies, namely, with the odds provided by Bet365.

To me, it was pretty fascinating to find out how close we can get to the bookies’ predictions by solely using the Transfermarkt data.

This will come as proof of the impact money has in the world of Football.

Note that in this article, I will mainly present the research outcomes with the corresponding visualizations. If you’re interested in the inner details and the code itself, you can run and inspect the Jupyter Notebook in Google Colab here.

Without further ado, let’s start with the analysis.

This article is based on the materials in the Coursera course – Prediction Models with Sports Data.

Preparing the Data

I will use data for the season 2019/2020.

The dataset contains two main parts:

  1. The games data. This consists of the home and away team, the final result, the game date, and the Bet365 Home/Draw/Away odds. I originally took this data from football-data.co.uk, but in the Colab notebook, it’s directly accessible via a public link to my Google Drive.
  2. The Transfermarkt market values for the teams for season 19/20. I took that from Transfermarkt and also uploaded it to Google Drive.

Let’s have a look at the games dataset:


The original dataset has some more fields, but I’ve left only the ones we’re interested in.

Next, let’s look at the Transfermarkt data:

The TM Value column contains the market value in millions. So, for example, Man City has a total market value of €1.04bn.

Next, let’s merge the two data sets, and for every game, attach the Transfermarkt values for the home and away team:

Note that I have also transformed the Bet365 odds into probabilities. To do so, I haven’t just used the formula (1/decimal_odds) because that would lead to the home/draw/away probabilities summing up to more than 1. That’s due to the bookies’ overround, also known as the vig or the profit margin. So I had to do some further normalization, which is also not too complex; you can check the notebook for the details.

The actual variable that we’ll use to predict the outcomes of the games is the ratio between tm_h and tm_a. More specifically, the logarithm of that ratio.

The reason for taking the log pertains to the distribution of the TM home/away ratio, which is right-skewed. Applying the log function leads to a bell-shaped distribution which improves performance.

Here’s the resulting column:

So, based on the log_tm_ratio, we’ll predict the outcomes of the games. In other words, this is our predictor(independent) variable.

Let’s encode the ftr column into a classification label with three values – 0 (for a home win), 1 (for a draw), and 2 (for away win).

Here’s the resulting win_value column:

Training the Model

Here’s the piece of code that extracts the train data (first 200 games), trains the Ordinal Logistic Regression model, and prints out the learned coefficients:

If you’re not familiar with Ordered Logistic Regression and you’d like to dig in, one option is to review Week 3 of the Prediction Models with Sports Data course.

Now, based on the regression coefficients, I will calculate the predicted home/draw/away probabilities for every game.

Let’s add a couple of columns signifying whether our prediction and Bet365 prediction are correct. These are calculated by taking the outcome with max probability and comparing it to the ftr.

Evaluate the Model and Compare with the Bookies

Remember that we’ve trained the model on the first 200 games. To evaluate the model and see how it performs on some new data, let’s use the rest of the games as a test set:

Have a closer look at the highlighted cells. These are the mean values of tm_true and b365_true. They show the portion of games that our model and Bet365 got right.

You can see how close they are. Both values are around 0.51, which proves that our model’s predictive power is very close to what was achieved by the bookies.

This is quite fascinating!

Note: The training data set I’ve used is too small (only 200 records) and may not be super stable statistically. I plan to create a much bigger experiment with much more data using various more advanced ML models. Still, I find the results from this research quite intriguing.

Let’s also have a quick reality check. I can hear you saying that 0.51 success rate in predicting football games is quite poor. Do you think you can do better? Remember, the whole business of the bookmakers is based upon getting these probabilities as accurate as possible – this is what keeps them in business. The truth is – the football game is quite unpredictable. After all – this is what makes it so exciting and special! There’s an army of domain experts and statisticians that squeeze every little detail that may affect a game’s outcome. If you want to try to beat them in their game – go ahead. They will welcome you with open hearts!

Now, out of curiosity, let’s cross-check the number and types of games our model and Bet365 got right or wrong:

You can see the numbers are very similar. For example, our model successfully predicted 28 away games, where the Bet365 model got one more correctly – 29. On the other hand, we truthfully predicted 65 home wins, while Bet365 did so in 63 cases.

One interesting observation is that neither our model nor Bet365 predicted a draw for any of the games. In other words, there wasn’t a single game with the draw being the outcome with the highest probability. For everyone familiar with the betting markets, this makes a lot of sense. I can’t remember the draw having the lowest decimal odds except for extremely rare(shady) cases.

Brier Score

The way we’ve evaluated the model so far is not ideal. We are only taking the max probability and check if that was the actual outcome. This way, a probability of 0.6 is interpreted similarly as a probability of 0.9, for example.

A better evaluation metric would be something that measures how far we were from the correct prediction. So for a successfully predicted outcome, our model would be much more penalized if the probability was 0.6 rather than 0.9

This is where we reach out for the Brier Score. It incorporates the difference between the actual value and the predicted probability.

Concretely, it’s calculated by taking the sum of the squared differences between the actual outcomes and the predicted probabilities.

I encourage you to look it up online or inspect the code snippet below:

A lower Brier Score means a better model. In that sense, Bet365 predictions are slightly better than ours as they have a lower Brier Score – 0.570 vs. 0.586

Summary

This article was a concrete demonstration of the role money plays in modern Football.

You’ve seen how a simple ML model, based only on the market values of the teams, can be a pretty good estimator for future games outcomes.

I hope you found this helpful.

See you next time!

Resources

  1. Prediction Models with Sports Data

Site Footer

Subscribe To My Newsletter

Email address