Illustrating the interplay between features and models in xG

Learning an accurate expected goals (xG) model requires coming up with a good set of features to describe the shot attempt. What is among the most important and predictive features is the shot’s location. The better and much more practically relevant question is how should one encode this location?

This seems like a simple question, but in reality there is a huge range of ways to represent a location. Moreover, the chosen representation can have a substantial impact on the quality of the learned model. Below, we illustrate eight of the many possible options for encoding a shot’s location:

#	Features	Illustration
1	Raw (x, y) location ▶	loading...
2	Raw (x, y) location + Distance to the goal
3	Raw (x, y) location + Angle to the goal
4	Distance to the goal + Angle to the goal
5	Distance to the goal + Angle to the goal + Distance*Angle
6	Raw (x, y) location + Distance to the goal + Angle to the goal
7	Polar coordinates
8	Hand-crafted partitioning of the field

We will compare the effects of these eight encodings of the location on the performance of a logistic regression xG model. We will use logistic regression for two reasons. First, it is easier to tease out the effect of each feature on performance. Second, our prior xG post found that using logistic regression with a simple feature set performed only slightly to using a much more expansive feature set with the more expressive gradient tree boosted ensemble.

We train eight logistic regression models, one for each way of encoding a shot’s location. We also augment the feature set to include a variable for the body part used in the shot, encoded as a dummy variable. We use two seasons of data from each of the top-5 European leagues and the Dutch one to train the model and report results on the 2018/2019 seasons of these leagues. As we have argued previously in several places, calibration is the key criteria for evaluating probabilistic predictions so we report the Brier score. To help place the results in context, we also include the area under the ROC curve (AUROC), even though it is a ranking metric.

#	Features	Brier ▲	AUROC	Illustration
1	Distance to the goal + Angle to the goal + Distance*Angle	0.0784	0.7676	loading...
2	Raw (x, y) location + Distance to the goal + Angle to the goal	0.0784	0.7678
3	Distance to the goal + Angle to the goal	0.0785	0.7669
4	Raw (x, y) location + Angle to the goal	0.0789	0.7645
5	Raw (x, y) location + Distance to the goal	0.0791	0.7648
6	Polar coordinates	0.0793	0.7635
7	Hand-crafted partitioning of the field	0.0811	0.7258
8	Raw (x, y) location	0.0837	0.7235

The different encodings of the location lead to quite some variation in the performance. Simply using the raw X and Y coordinates (that is, the data as given) yields poor performance, the subsequent section will explain why this is the case. While the hand-crafted partition¹ offers some benefit over using the raw data, it is still too coarse-grained. Adding the distance to the goal offers a jump in performance by (1) capturing the interaction between X and Y coordinate and (2) exploiting our domain knowledge to pick the center of the goal as the reference point. Including the angle to the goal on top of the distance offers another slight improvement as it helps differentiate between equidistant positions. Finally, the addition of a Distance*Angle interaction term further improves the model slightly by capturing that long-range shots quickly become harder from a wider angle. The same effect is captured by adding the raw X and Y coordinates.

Why does having the “right” feature affect performance?

Features are important because the model uses the features to make distinctions among different shots. Having the right features allows the model to make subtler distinctions.

At a conceptual level, a machine-learned model is going to carve up the physical pitch into different areas on the basis of which features the analyst has selected. Each of these areas will have a probability of scoring. Each class of model (e.g., neural network, tree, logistic regression, support vector machine, etc.) will use the features in vastly different ways. Moreover, the vast majority of model classes will impose restrictions on how the pitch can be carved up.

To illustrate this effect, we have trained a logistic regression model and a single probability estimation tree using two features: the X and Y coordinates of the shot. Below we visualize the output of each model. You can hover over the heatmaps to compare the exact xG values at each location. Also note that these plots show the xG value of shots by foot. The reported Brier and AUROC scores take headers and other body parts into account as well.

	Brier	AUROC
Logistic regression	0.0837	0.7235
Probability estimation tree	0.0787	0.7638

The logistic regression model struggles in this case to learn an accurate model. From an application perspective, the interaction between the X and Y coordinates is what is important. However, logistic regression is assuming that the features do not interact and hence cannot capture this interplay. That is why the feature set that only includes the distance to goal, which captures this interaction, offers better performance.

In contrast, a probability estimation tree is able to capture the interplay between the two coordinates by setting multiple thresholds on each of the X and Y coordinates. The result is that the tree learns to partition the pitch into different zones and assign a probability of scoring to each one. The simplest approach to estimating the probability of scoring within a zone is to compute the empirical frequency of goals in that zone by dividing the number of goals scored in the zone by the total number of shots taken in the zone. Common variants of this idea are to perform smoothing using an m-estimate or kernel density estimator.

Interestingly, the probability estimation tree using just the X and Y coordinate of the shot achieves only a slightly worse Brier Score than the logistic regression model using the advanced feature set from our prior blog post. When that model is also trained on 12 seasons of data, its Brier Score is 0.0783. When given access to sufficient data, a single probability estimation tree yields accurate estimates for two reasons. First, it is able to come up with a much more fine-grained partition of the pitch. Second, more data reaches each leaf node, meaning the estimates are simply better due to larger sample sizes. However, one issue with probability estimation trees is that they give wrong xG values to zones where almost no shots are taken (e.g., near the corners or far away from the goal) since there is not enough data to make a split.

Another important contrast between the two models is that the logistic regression model’s probability estimates change in a smooth, constant manner as the X and Y coordinates are varied slightly. In contrast, in the probability estimation tree model, the probability estimates can change abruptly between zones. Hence, a slight change in the value of one coordinate could lead to a very different prediction. Moreover, there is no fixed relationship between the estimates of neighboring zones.

To further illustrate the interplay between features and learning algorithms, we will train both a logistic regression model and probability estimation tree using a single feature: distance to goal. Below we visualize the output of each model.

	Brier	AUROC
Logistic regression	0.0785	0.7669
Probability estimation tree	0.0791	0.7570

The logistic regression model improves its performance over using the raw X and Y coordinates. By combining the X and Y coordinate, all equidistant locations receive the SAME probability of scoring. Hence, it is able to capture some symmetries such as positions slightly to the left and slightly to the right of the penalty spot should have similar scoring probabilities that the prior model missed. Conversely, this also leads to estimates that clearly do not correspond to reality (and our intuitions). For example, the model assigns equal probabilities to a spot directly in front of the center of the goal and to a spot on the end line if both spots are at equal distances from the center of the goal. This point highlights why the angle to the goal is an important feature to include, particularly in a logistic regression xG model. Finally, the visualization again shows the very smooth nature by which the probability estimates decrease as a player gets farther from the goal.

In contrast, by only considering a single feature, the probability estimation tree has to partition the pitch in a much different manner than it did when it had access to the raw coordinates. With only one variable, it loses some of its power to make more complex and fine-grained decisions.

Conclusion

Building accurate models requires understanding how the considered model class uses the features to make decisions. Clever feature construction is crucial as it can help alleviate assumptions made by the model (e.g., that features do not interact) and decrease the reliance on the availability of massive data sets.

While we believe that trees are awesome, in all honesty we were slightly surprised by how well a single probability estimation tree trained on just the X and Y coordinates performed. This highlights the power of using a model class that enables automatically learning feature interactions from data. However, one needs sufficient data for this to work as performance suffered when the probability estimation trees were trained on small data sets (e.g., 2 seasons worth of data).

As a hand-crafted partitioning, we use the shot location matrix introduced by Michael Caley.↩

Illustrating the interplay between features and models in xG

Why does having the “right” feature affect performance?

Conclusion

More articles from DTAI Sports

A data-driven review of the first premier league half

Valuing On-the-Ball Actions in Soccer: A Critical Comparison of xT and VAEP