Training an accurate xG model requires defining features that faithfully represent the relevant characteristics of the shot’s context. These include properties of
- the shot itself such the location from which the shot was taken and the body part used,
- the actions that precede the shot such as whether the shot follows from a corner or fast counter-attack, and
- the space given to the shot taker.
Many blog posts have discussed how to model the first two contexts.1 This blogpost, in contrast, explores how to exploit the extra information available in StatsBomb’s freeze frame data to address the following question: How should you encode the position of th defenders and the goalkeeper as features in an xG model?
What is the freeze frame?
Modelling the proximity of defenders has received less attention due to the limited availability of tracking data and the fact that event stream data only includes the location of the player who is executing the shot. Nowadays, this type of information is becoming more prevalent in soccer. Tracking data is slowly becoming more widely available and the event data providers are constantly adding more contextual data to their event streams. StatsBomb, for example, is providing an incredible level of detail with the freeze frame data, which incorporates the position of every player around the ball and the goalkeeper for shot events.
The figure below illustrates what this freeze frame looks like for the goal of Messi against Athletic Bilbao on 27-04-2013. The blue dots represent players from Barcelona and red dots are players from Athletic. Messi is shown in orange. Despite being surrounded by three players, he manages to get a shot off that beats the keeper to the far post.
How should you translate the freeze frame into a set of features?
xG models in the literature have tried to model the effect of defenders using features like the distance to the closest defender, or the number defenders between the shooter and the goal.2 In this blog, we introduce a free projection feature. This new feature is motivated by the fact that features like the number of defenders between the shooter and the goal are really a proxy for the following important variable: How does the positioning of the defenders affect the proportion of the goal that is available to the shooter?
For example, in the figure the shooter is taking a shot from position A, there is one defender in position B and the goalkeeper is at position C. To capture the fact that players are not static and will react to the shot, we assume that the defending player has an effective span of one arm length (80 cm) and the goalkeeper has an effect of two arm lengths (160 cm). When their position is projected on the goal, the area denoted by the white bracket is the free projection of this shot, which gives an indication of how much of the goal is left uncovered by the defenders.
Apart from this free projection feature, we generate several other features that describe the positions of defenders and the goalkeeper:
- Closest defender: the distance between the shot-taker and the closest defender in the area between the shot and the goal
- Number of defenders: the number of defenders in the area between the shot and the goal
- Goalkeeper x: the goalkeeper’s location on the x (horizontal) axis of the pitch
- Goalkeeper y: the goalkeeper’s location on the y (vertical) axis of the pitch
- Distance to goalkeeper: the distance between the shot taker and the goalkeeper
- Under pressure: whether the shot is made under pressure (using StatsBomb’s pressure events)
- One on one: whether the shot is made one on one with the goalkeeper
How does including these features affect the xG model’s performance?
To illustrate the value of these features, we trained two XGBoost models with 10 fold cross validation on the Messi data biography: one including the freeze frame features and one excluding those features. We excluded goals from penalties and free kicks. The results are shown in the table below. Not surprisingly, adding the freeze frame features improves the xG model.
Brier | AUROC | |
---|---|---|
XGBoost without freeze frame features | 0.0876 ± 0.0139 | 0.7824 ± 0.0150 |
XGBoost with freeze frame features | 0.0818 ± 0.0132 | 0.8054 ± 0.0178 |
XGBoost can also provide us with feature importance scores. Below we see that the most important features are a combination of features that describe the characteristics of the shot itself (i.e., distance to goal, header and angle to goal) and freeze frame features (i.e., distance to goalkeeper, goalkeeper x location, free projection), proving the importance of defensive positioning when assessing the goal expectancy of a shot.
Conclusion
The position of defenders and the goalkeeper has been ignored for a long time in xG models due to the lack of data. In this blog post, we proved that such extra information is crucial for building performant xG models and that StatsBomb enabled a huge leap by putting great efforts into accurately providing player locations for shots. However, encoding this new data into features is not trivial. Adding to the existing research, we proposed a free projection feature, measuring how the positioning of the defenders affects the proportion of the goal that is available to the shooter.
This blogpost is the first part of a summary of Anıl Cem Arslan’s master’s thesis. Data provided by StatsBomb.
Footnotes
-
We briefly discussed this in an earlier blog post. Others who discussed this are Michael Caley, Sander Ijtsma, American Soccer Analysis,… ↩
-
This is how Lucey et al. did it in their 2015 MIT Sloan paper “Quality vs Quantity: Improved Shot Prediction in Soccer using Strategic Features from Spatiotemporal Data” using tracking data. Jernej Flisar used the number of players between the shot and goal to build a model with Statsbomb’s freeze frame. ↩