Versatile Verification of Soccer Analytics Models

In soccer analytics there has been an evolution towards using black-box models that are trained on a large data set. One of the most prominent classes of models are tree ensembles such as gradient boosted ensembles (e.g., XGBoost) and random forests. These models underpin advanced metrics for valuing the contributions of soccer players such as VAEP and g+ as well as some implementations of expected goals (xG). These models tend to offer superior performance to simpler models such as logistic regression at the cost of being far less interpretable.

The lack of interpretability has several drawbacks. First, it makes it hard to gain insights from the learned models. For example, we may want to characterize situations where crosses or dribbles would yield high VAEP values. Second, it makes it hard to verify whether the model will always behave as we expect and want it to. Or in other words, how much trust should we place in the model’s predictions.

In this post, we will describe novel AI research in the area of verification which provides the ability to reason about learned models and show how it can benefit soccer analytics. Specifically, we will:

Briefly describe our approach for verification of tree ensembles
Show how verification can be used to reason about what our VAEP model has learned about the game of soccer and give tactical advice based on these insights
Show how verification can help debug the data used to train an xG model
Show that action-value models based on boosted trees are susceptible to so-called adversarial examples. That is, for example, they may value nearly identical event sequences quite differently

What is verification and how do we do it

Verification is a relatively new and active area of research in machine learning. Its most popular application is the generation of so-called adversarial images. The idea is to see whether it is possible to trick an image classification model by slightly changing the image. More generally, verification attempts to reason about a learned model in order to gain better insight into how it will behave in practice. Yet, there are not many practical examples in machine learning beyond image classification that illustrate the abilities of verification.

In this blog post, we will apply our work on verification of tree ensembles to models learned on soccer event stream data. Soccer is an ideal setting for demonstrating the concepts and the practical applicability of verification. In soccer analytics, verification could be used to answer a question such as: “Could a backward pass outside of the penalty box ever increase the probability of scoring?”

Notice how this question deviates from the standard use of a machine learned model, which is:

Machine Learning

Given: A specific example as input
Do: Find its predicted label

Verification flips this task on its head.

Verification

Given:
(a) a desired label (in this case a high probability of scoring), and
(b) constraints on the example’s features (in this case the pass must move away from the goal, and the pass must occur outside of the penalty box)
Do: Find an example that satisfies these constraints and yields the desired label

Our approach is called Veritas. It only considers the model itself, that is, it does not consider the data used to train the model at all. On a technical level, our approach uses search techniques from AI to try to construct examples that satisfy the constraints. For more details, please see our ICML paper.

Peeking under the hood of action-value models

In a panel at Sloan Sports Analytics Conference in 2019, Daryl Morey asked Ted Knutson if it was ever useful to pass the ball backwards in soccer. Ted pointed to a then recent paper by Javier Fernandez and Luke Bornn that contained some examples of backwards passes that occurred in matches that were identified by a black-box model as increasing a team’s chance of scoring. This highlights the current paradigm for answering questions about a black-box model: you can look at the predictions on observed examples and see if an outcome of interest occurs. What is lacking is the ability to generalize from these observations to characterize generic situations where passing the ball backwards would be useful. The ability to reason about a learned model can fill this gap. In this case, we address the question what increases the odds of scoring most: passing from the green zone as indicated below to the red zone on the flank or passing backwards.

Hover your mouse over the green rectangle to view the probability of scoring a goal within the next 10 actions after a pass to one of the red zones.

The answer depends on the location from where the pass is taken and the targeted location in both zones, but the differences are small in any case. It seems that passing backwards can be a viable option when it is not possible to play the ball in the direction of the goal. Also, notice how passing the ball backward is much more advantageous when the pass is given from the penalty box line. This is due to how people annotate the data. When confronted with an action near the line, their tendency is to record the event’s location adjacent to rather than on the line. This is reinforced by how players unconsciously react to the actual lines themselves. We discuss how verification can help to identify these issues in the next section.

As a second use case we look into how to optimally cross the ball. Crossing as a tactic has been slowly superseded by possession-based football since the emergence of statistical analysis, with questions arising about its inefficiency and waning conversion ratio. Nevertheless, the cross remains a viable tactical option and an essential part of the game plan of many teams. Rather than having completely disappeared from the game, the way teams cross the ball has evolved from the classic bomb it in from the far sideline hoping it reaches a team-mate to low cut-back crosses and curlers behind the backline. We can use verification to check whether these new strategies are indeed more efficient.

The figure below shows the probability of scoring on a successful cross from the selected point in the green box to any point in the red box (or the other way around if you hover over the red box).

Hover your mouse over the green or red rectangle to view the probability of scoring a goal within the next 10 actions after a cross from the green to the red zone.

The model has learned that:

The probability of scoring is highest if the crosser manages to reach a teammate in the goal area. Of course, this is the keeper’s domain and the success ratio is low here.
Crosses to the second post are slightly more likely to result in a goal than crosses to the first post. When crossing from outside the penalty box, it is better to aim for the second post. When crossing from inside the penalty box, it is better to aim for the first post.
Players should dribble in a couple of meters from the sideline before crossing the ball.
Trying to reach the backline before crossing is not needed. A curler is the most efficient type of cross when aiming for the zone between the goal and the penalty spot.
Perhaps surprisingly, cut-backs do not have a higher scoring probability than other crosses in the zone between the goal and the penalty spot, but have a much higher probability of scoring compared to other crosses when the ball is played further away from the goal.

Debugging the data used to train an xG model

Using XGBoost, we trained a simple xG model using the shot’s coordinates, distance to goal, and angle to the center of the goal as features. We only considered shots kicked with the foot. Next, we used our verification approach to generate 200 hypothetical shots from outside the penalty box that have the highest chance of resulting in a goal according to the model. That is, we constrain the shot’s location to be outside of the penalty box and allow the algorithm to construct examples that satisfy this constraint. These are shown in the following heatmap:

One cluster of instances is found on the edge of the box, directly in front of the center of the goal. This makes sense and corresponds to areas where it may be advantageous for teams to shoot.

However, the instances generated near the corner spots are unexpected. Therefore we investigated the data. In the square 5 m around the corner spot. There are 11 shots and 8 goals, which yield an extremely high 72% conversion rate. One possible explanation is to recall that this data was recorded by human annotators. If a player kicks the ball from the locations near the corner, the annotators are likely labeling the action as a pass or cross and are only assigning an action the label of a shot in the unlikely event that it results in a goal or save. While this is a well-known problem in the soccer analytics community, this highlights how verification can identify unexpected patterns and biases that might be hidden in the data, and hence provide insight into, e.g., how the data was collected and annotated.

Adversarial Examples

After training an action-value model like VAEP, the model will be used to rate the actions of players in multiple future games and seasons. Given an action sequence, a machine learned model will always make a prediction and it is desirable that the predictions conform to our expectations. Usually we will get something that is sensical. Unfortunately, the model may behave unexpectedly in certain situations. We will focus on two cases:

What happens if the new data contains an action sequence that is very dissimilar to any of the sequences observed during training? This could arise due to mistakes in how the event data was annotated such as having wrong coordinates or the wrong action type.
Will slightly perturbing the data lead to a dramatically different action value? This is often referred to as an adversarial example in machine learning.

To illustrate the first case, consider the following sequence of a cross followed by a shot illustrated below:

We would not expect an action to be labeled as a cross in this location. Furthermore, a player is highly unlikely to shoot in this location. However, if such a sequence appeared in the data, the shot would be valued at 0.25 which is extremely high. This would only be a sensical value in extremely rare situations (e.g., goalie nowhere near the goal).

To illustrate the second case, we hold everything about these two actions constant except for the time remaining in the half when the action was performed. The following plot shows how the probability of scoring in the next 10 actions after the shot evolves as a function of time:

The probability gradually increases until about 27 minutes into the first half. Then the probability of scoring dramatically spikes and over doubles. Clearly, this behavior is also undesirable as we would not expect these probabilities to vary so much. This also suggests that we should handle time differently in the model.

This post is based on the following publication: Laurens Devos, Wannes Meert & Jesse Davis. (2021). Versatile Verification of Tree Ensembles. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:2654-2664.
Source code is available on GitHub.