Our Publications

Below you can find an onverview of our scientific publications, technical reports and invited talks.


Journal (13)Conference (18)Workshop (33)PhD thesis (3)Technical report (5)Book chapter (2)Other (1)

soccer (60)basketball (4)health (6)volleyball (1)running (8)

Showing 75 / 75 publications.

2024

  • Jesse Davis, Pieter Robberechts

    Expected Goals (xG) has emerged as a popular tool for evaluating finishing skill in soccer analytics. It involves comparing a player's cumulative xG with their actual goal output, where consistent overperformance indicates strong finishing ability. However, the assessment of finishing skill in soccer using xG remains contentious due to players' difficulty in consistently outperforming their cumulative xG. In this paper, we aim to address the limitations and nuances surrounding the evaluation of finishing skill using xG statistics. Specifically, we explore three hypotheses: (1) the deviation between actual and expected goals is an inadequate metric due to the high variance of shot outcomes and limited sample sizes, (2) the inclusion of all shots in cumulative xG calculation may be inappropriate, and (3) xG models contain biases arising from interdependencies in the data that affect skill measurement. We found that sustained overperformance of cumulative xG requires both high shot volumes and exceptional finishing, including all shot types can obscure the finishing ability of proficient strikers, and that there is a persistent bias that makes the actual and expected goals closer for excellent finishers than it really is. Overall, our analysis indicates that we need more nuanced quantitative approaches for investigating a player's finishing ability, which we achieved using a technique from AI fairness to learn an xG model that is calibrated for multiple subgroups of players. As a concrete use case, we show that (1) the standard biased xG model underestimates Messi's GAX by 17% and (2) Messi's GAX is 27% higher than the typical elite high-shot-volume attacker, indicating that Messi is even a more exceptional finisher than people commonly believed.

    ArxivBibTex

2023

  • StatsBomb

    Deniz Can Oruç, Lorenzo Cascioli, Luca Stradiotti, Maaike Van Roy, Pieter Robberechts, Jesse Davis

    No abstract available

    URLBibTex

  • MLSA

    Maaike Van Roy, Lorenzo Cascioli, Jesse Davis

    Event data, which records high-level semantic events (e.g., passes), and tracking data, which records positional information for all players, are the two main types of advanced data streams used for analyses in soccer. While both streams when analyzed separately can yield relevant insights, combining them allows us to capture the entirety of the game. However, doing so is complicated by the fact that the two data streams are often not synchronized with each other. That is, the timestamp associated with an event in the event data does not correspond to the analogous frame in the tracking data. Thus, a key problem is to align these sources. However, few papers explicitly describe approaches for doing so. In this paper, we propose a rule-based approach ETSY for synchronizing event and tracking data, evaluate it, and compare experimentally and conceptually with the few state-of-the-art approaches available.

    BibTex

  • MLSA

    Jesse Davis, Pieter Robberechts

    Expected Goals (xG) has emerged as a popular tool for evaluating finishing skill in soccer analytics. It involves comparing a player’s cumulative xG with their actual goal output, where consistent overperformance indicates strong finishing ability. However, the assessment of finishing skill in soccer using xG remains contentious due to players’ difficulty in consistently outperforming their cumulative xG. In this paper, we aim to address the limitations and nuances surrounding the evaluation of finishing skill using xG statistics. Specifically, we explore three hypotheses: (1) the deviation between actual and expected goals is an inadequate metric due to the high variance of shot outcomes and limited sample sizes, (2) the inclusion of all shots in cumulative xG calculation may be inappropriate, and (3) xG models contain biases arising from interdependencies in the data that affect skill measurement. Our findings indicate that the natural variability in performance throughout a season is significant, to the extent that even players with average finishing skills have a reasonable chance of surpassing their xG values. Moreover, including all shot types can obscure the finishing ability of proficient strikers. Additionally, our simulation experiments demonstrate that the presence of a realistic bias in the data, stemming from proficient finishers taking a higher proportion of shots, reduces the effectiveness of utilizing xG as a skill measurement tool. Our analysis indicates that we need more nuanced quantitative approaches for investigating a player’s finishing ability.

    PDFBibTex

  • KDD

    Pieter Robberechts, Maaike Van Roy, Jesse Davis

    Creativity is highly valued in soccer players. It contributes to exciting and unpredictable play, which can help teams to overcome defensive strategies and create scoring opportunities. Consequently, evaluating the creative abilities of players is an important aspect of the player recruitment process. However, there is currently no clear way to measure creativity in soccer. It is not captured by the typical result-based performance indicators, as being creative entails going beyond just doing something useful, to accomplishing something useful but in a unique or atypical way. Therefore in this paper, we define a novel metric to quantify the level of creativity involved in a player's passes. Our Creative Decision Rating (CDR) utilizes machine learning techniques to assess two important factors: the originality of a pass, and its value in terms of increasing the team's chances of scoring a goal. We validated our metric on StatsBomb 360 contextual event stream data of the 2021/22 English Premier League season and show through a number of use cases that it provides another angle on a player's skill, complementing existing player evaluation metrics. Overall, our metric provides a concise method for capturing and quantifying the creativity of soccer players and could have important implications for player recruitment and talent development in the sport.

    PDFDOIURLBibTex

  • JAIR

    Maaike Van Roy, Pieter Robberechts, Wen-Chi Yang, Luc De Raedt, Jesse Davis

    Strategy-optimization is a fundamental element of dynamic and complex team sports such as soccer, American football, and basketball. As the amount of data that is collected from matches in these sports has increased, so has the demand for data-driven decisionmaking support. If alternative strategies need to be balanced, a data-driven approach can uncover insights that are not available from qualitative analysis. This could tremendously aid teams in their match preparations. In this work, we propose a novel Markov modelbased framework for soccer that allows reasoning about the specific strategies teams use in order to gain insights into the efficiency of each strategy. The framework consists of two components: (1) a learning component, which entails modeling a team’s offensive behavior by learning a Markov decision process (MDP) from event data that is collected from the team’s matches, and (2) a reasoning component, which involves a novel application of probabilistic model checking to reason about the efficacy of the learned strategies of each team. In this paper, we provide an overview of this framework and illustrate it on several use cases using real-world event data from three leagues. Our results show that the framework can be used to reason about the shot decision-making of teams and to optimise the defensive strategies used when playing against a particular team. The general ideas presented in this framework can easily be extended to other sports.

    DOIBibTex

  • WCSF

    Lotte Bransen, Jesse Davis

    No abstract available

    BibTex

2022

  • IACSS

    Maaike Van Roy, Jesse Davis

    In soccer analytics, tree ensembles are often used to predict the expected value of different actions (e.g., shots, passes, dribbles). However, a relevant question is how much trust can we place in the model’s predictions? While in general, these models perform well, there are two scenarios where their predictions should be treated with caution. First, if the data contains annotation errors, then the model’s prediction is inherently wrong. Second, and more subtly, are actions that are highly dissimilar to what the model has seen during training. Machine-learned models struggle with extrapolation, and hence the model’s value for such actions may be unreliable. This work aims to automatically flag the above two scenarios, to help contextualize such models’ predictions.

    BibTex

  • PhD thesis

    Arne De Brabandere

    Time series, i.e., data collected from processes that change over time, are collected more often than we think. For example, activity trackers continuously record our heart rate, stock traders observe daily stock prices, and weather forecasters carefully analyse meteorological data. Time series typically contain a large number of observations, which makes analysing these observations by hand a complex and time-consuming task. Therefore, we need tools that support practitioners by automating time series analysis. In this dissertation, we focus on two time series analysis tasks: feature construction and change point detection. We aim to address three challenges that are not solved by existing automated time series analysis methods. First, current feature construction methods only analyse each time series individually. By not exploiting relations between multiple series, these methods may miss important features. Second, automated feature construction methods are completely data-driven. However, domain experts can often make suggestions about which kind of features are potentially relevant. Unfortunately, existing methods are not able to incorporate these suggestions. Third, change point detection is currently tackled in either a fully supervised or fully unsupervised setting. On the one hand, supervised methods can find an accurate segmentation by exploiting labels. However, these methods require annotating the data, which is a time-consuming process. On the other hand, unsupervised methods require no labels, but have to make assumptions about how the underlying statistics of the time series correlate to the series' state. Unfortunately, making incorrect assumptions may lead to different change points than expected. This dissertation makes five main contributions. The first two contributions present two automated time series analysis approaches that address the challenges described above. First, we propose an automated feature construction method that exploits relations between multiple time series by fusing multiple series. Our approach can incorporate domain knowledge in the form of metadata and compatibility constraints. Second, we propose a semi-supervised change point detection method that uses active learning to obtain labels. In the remaining three contributions, we evaluate the performance of our automated feature construction approach on real-world time series data in health applications. First, we develop an activity recognition model and propose techniques to improve the model's performance on real-world data collected from patients. Second, we develop a joint loading estimation model based on data collected by a mobile phone. Third, we compare the performance of hand-crafted and automatically constructed features for epileptic seizure detection.

    PDFBibTex

  • Front Bioeng Biotechnol

    Loren Nuyts, Arne De Brabandere, Sam Van Rossom, Jesse Davis, Benedicte Vanwanseele

    Although running has many benefits for both the physical and mental health, it also involves the risk of injuries which results in negative physical, psychological and economical consequences. Those injuries are often linked to specific running biomechanical parameters such as the pressure pattern of the foot while running, and they could potentially be indicative for future injuries. Previous studies focus solely on some specific type of running injury and are often only applicable to a gender or running-experience specific population. The purpose of this study is, for both male and female, first-year students, (i) to predict the development of a lower extremity overuse injury in the next six months based on foot pressure measurements from a pressure plate and (ii) to identify the predictive loading features. For the first objective, we developed a machine learning pipeline that analyzes foot pressure measurements and predicts whether a lower extremity overuse injury is likely to occur with an AUC of 0.639 and a Brier score of 0.201. For the second objective, we found that the higher pressures exerted on the forefoot are the most predictive for lower extremity overuse injuries and that foot areas from both the lateral and the medial side are needed. Furthermore, there are two kinds of predictive features: the angle of the FFT coefficients and the coefficients of the autoregressive AR process. However, these features are not interpretable in terms of the running biomechanics, limiting its practical use for injury prevention.

    PDFDOIBibTex

  • StatsBomb

    Pieter Robberechts, Maaike Van Roy, Jesse Davis

    Kevin De Bruyne is an unparalleled genius when it comes to bringing creativity to the pitch. Time after time, he sees the options that other players don’t see and his sparks of creativity have frequently turned a closed game around. Clubs and analysts therefore look for this trait when scouting for new players. However, it is generally unknown how creativity can be concisely captured and quantified. To aid clubs and analysts in their scouting process, our research will propose a novel performance metric to quantify the creative abilities of soccer players. Additionally, we will show the immediate applicability of this metric by illustrating it in various scenarios.

    URLyoutubeBibTex

  • EBeM

    Jesse Davis, Lotte Bransen, Laurens Devos, Wannes Meert, Pieter Robberechts, Jan Van Haaren, Maaike Van Roy

    There has been an explosion of data collected about sports. Because such data is extremely rich and complicated, machine learning is increasingly being used to extract actionable insights from it. Typically, machine learning is used to build models and indicators that capture the skills, capabilities, and tendencies of athletes and teams. Such indicators and models are in turn used to inform decision-making at professional clubs. Unfortunately, how to evaluate the use of machine learning in the context of sports remains extremely challenging. On the one hand, it is necessary to evaluate the developed indicators themselves, where one is confronted by a lack of labels and small sample sizes. On the other hand, it is necessary to evaluate the models themselves, which is complicated by the noisy and non-stationary nature of sports data. In this paper, we highlight the inherent evaluation challenges in sports and discuss a variety of approaches for evaluating both indicators and models. In particular, we highlight how reasoning techniques, such as verification can be used to aid in the evaluation of learned models

    PDFBibTex

  • ECML/PKDD

    Jeroen Clijmans, Maaike Van Roy, Jesse Davis

    Analyzing the offensive playing style of teams is an important task within soccer analytics that has various applications in match preparation and scouting. Existing data-driven approaches typically quantify style by looking at individual events that occur during a match in isolation. This approach has two shortcomings. First, it ignores the sequential aspect of the game, as patterns of play are a crucial aspect of playing style. Second, it fails to generalize over the limited amount of data in order to model slight variations of the observed patterns that a team may employ in the future. This is particularly important when considering rare actions like shots and goals, which are the key success criteria of an offensive style. This paper proposes a novel approach for analyzing playing style that addresses these shortcomings. First, it captures the sequential patterns of a team’s style by modeling the observed behavior of a team as a discrete-time Markov chain. Second, it characterizes the offensive style of teams in a number of features that are based on domain knowledge. It applies a combination of analytical techniques and probabilistic model checking to reason about a team’s model in order to extract values for these features. As the model allows for a generalization of a team’s past behavior, the extracted style is less influenced by the rarity of shots and goals. Using event stream data of the 2019/20 English Premier League, we empirically show that the proposed approach can capture a team’s positional and sequential style, as well as reason about the style’s efficiency and similarities with other teams

    PDFBibTex

  • Int J Sports Physiol Perform

    Kobe C Houtmeyers, Pieter Robberechts, Arne Jaspers, Shaun J McLaren, Michel S Brink, Jos Vanrenterghem, Jesse J Davis, Werner F Helsen

    PURPOSE: To examine the utility of differential ratings of perceived exertion (dRPE) for monitoring internal intensity and load in association football. METHODS: Data were collected from 2 elite senior male football teams during 1 season (N = 55). External intensity and load data (duration × intensity) were collected during each training and match session using electronic performance and tracking systems. After each session, players rated their perceived breathlessness and leg-muscle exertion. Descriptive statistics were calculated to quantify how often players rated the 2 types of rating of perceived exertion differently (dRPEDIFF). In addition, the association between dRPEDIFF and external intensity and load was examined. First, the associations between single external variables and dRPEDIFF were analyzed using a mixed-effects logistic regression model. Second, the link between dRPEDIFF and session types with distinctive external profiles was examined using the Pearson chi-square test of independence. RESULTS: On average, players rated their session perceived breathlessness and leg-muscle exertion differently in 22% of the sessions (range: 0%-64%). Confidence limits for the effect of single external variables on dRPEDIFF spanned across largely positive and negative values for all variables, indicating no conclusive findings. The analysis based on session type indicated that players differentiated more often in matches and intense training sessions, but there was no pattern in the direction of differentiation. CONCLUSIONS: The findings of this study provide no evidence supporting the utility of dRPE for monitoring internal intensity and load in football.

    DOIBibTex

  • Sensors

    Sieglinde Bogaert, Jesse Davis, Sam Van Rossom, Benedicte Vanwanseele

    Even though practicing sports has great health benefits, it also entails a risk of developing overuse injuries, which can elicit a negative impact on physical, mental, and financial health. Being able to predict the risk of an overuse injury arising is of widespread interest because this may play a vital role in preventing its occurrence. In this paper, we present a machine learning model trained to predict the occurrence of a lower-limb overuse injury (LLOI). This model was trained and evaluated using data from a three-dimensional accelerometer on the lower back, collected during a Cooper test performed by 161 first-year undergraduate students of a movement science program. In this study, gender-specific models performed better than mixed-gender models. The estimated area under the receiving operating characteristic curve of the best-performing male- and female-specific models, trained according to the presented approach, was, respectively, 0.615 and 0.645. In addition, the best-performing models were achieved by combining statistical and sports-specific features. Overall, the results demonstrated that a machine learning injury prediction model is a promising, yet challenging approach.

    PDFDOIBibTex

2021

  • STC

    Lotte Bransen, Jesse Davis

    While infrequent, penalties are an important aspect in football. The relative paucity of goals from open play, and high chance of converting means that converting one swing the outcome of a match. Moreover, penalty shoot outs can play a huge role in knock-out tournaments. Penalties are a game between the penalty taker and the goalkeeper, with each participant always looking for an edge. This paper tries to provide such an edge to goalies by proposing a model that predicts the location of a penalty based on in-match information on the penalty taker’s performance in the match. We observe that players who tend to play well are more likely to aim for their natural side (i.e. left for right-footed players, right for left-footed players), whereas factors such as the goalkeeper’s performance in the match, the scoreline and whether the penalty was part of a penalty shootout influence the taker’s choice. We also observe that there are differences in how men and women alter their tendencies based on their in-match performance. Finally, we show how our Penalty Direction Predictor can be used in real-time during matches and provide some illustrative examples.

    PDFBibTex

  • StatsBomb

    Maaike Van Roy, Pieter Robberechts, Jesse Davis

    Our research proposes a novel way to identify a team’s most effective build-up patterns and the corresponding defensive positioning strategies to disrupt them.

    PDFyoutubeBibTex

  • MLSA

    Simon Merckx, Pieter Robberechts, Yannick Euvrard, Jesse Davis

    Pressing is an important aspect of a soccer team’s defensive strategy. By exerting pressure on the player in possession of the ball, the goal is to win the ball back or at the very least deny the opponents the opportunity to develop an attack. Analyzing and evaluating the effectiveness of pressing strategies is a very important task for any professional match-analyst, but is currently being done exclusively manually by observing video footage. Automating the task saves analysts a tremendous amount of time, standardizes the otherwise subjective task, and allows to identify trends within larger data sets. Therefore, the purpose of this work is to automate the analysis of a soccer team's defensive pressing strategy. Based on a combination of positional and event data, we first detect pressing situations using a set of expert-based rules. These pressing situations are successively objectively evaluated by modelling pressing as a trade-off between the benefits of recovering the ball versus the cost of leaving the defensive structure, which makes passing through the lines easier for the opposition. We applied this automatic analysis on all matches from a full regular season of the Belgian league and show how our metric can be used in practice through a number of use cases.

    URLBibTex

  • ECSS

    Kobe Houtmeyers, Pieter Robberechts, Werner Helsen, Jesse Davis, Arne Jaspers, Shaun McLaren, Michel Brink, Jos Vanrenterghem

    No abstract available

    URLBibTex

  • KDD

    Pieter Robberechts, Jan Van Haaren, Jesse Davis

    In-game win probability models, which provide a sports team's likelihood of winning at each point in a game based on historical observations, are becoming increasingly popular. In baseball, basketball and American football, they have become important tools to enhance fan experience, to evaluate in-game decision-making, and to inform coaching decisions. While equally relevant in soccer, the adoption of these models is held back by technical challenges arising from the low-scoring nature of the sport. In this paper, we introduce an in-game win probability model for soccer that addresses the shortcomings of existing models. First, we demonstrate that in-game win probability models for other sports struggle to provide accurate estimates for soccer, especially towards the end of a game. Second, we introduce a novel Bayesian statistical framework that estimates running win, tie and loss probabilities by leveraging a set of contextual game state features. An empirical evaluation on eight seasons of data for the top-five soccer leagues demonstrates that our framework provides well-calibrated probabilities. Furthermore, two use cases show its ability to enhance fan experience and to evaluate performance in crucial game situations.

    PDFDOIArxivBibTex

  • AISA

    Maaike Van Roy, Wen-Chi Yang, Luc De Raedt, Jesse Davis

    Markov models are commonly used to model professional sports matches as they enable modelling the various actions players may take in a particular game state. In this paper, our objective is to reason about the goal-directed policies these players follow. Concretely, we focus on soccer and propose a novel Markov decision process (MDP) that models the behavior of the team possessing the ball. To reason about these learned policies, we employ techniques from probabilistic model checking. Our analysis focuses on defense, where a team aims to minimize its risk of conceding a goal (i.e., its opponent scores). Specifically, we analyze the MDP in order to gain insight into various ways an opponent may generate dangerous situations, that is, ones where the opponent may score a goal, during a match. Then, we use probabilistic model checking to assess how much a team can lower its chance of conceding by employing different ways to prevent these dangerous situations from arising. Finally, we consider how effective the defensive strategies remain once the offensive team adapts to them. We provide multiple illustrative use cases by analyzing real-world event stream data from professional soccer matches in the English Premier League.

    PDFURLBibTex

  • AISA

    Lotte Bransen, Jesse Davis

    Technical data such as event or optical tracking data from men’s football (soccer) matches have been extensively analysed using techniques from AI on a variety of different levels. However, there has been very little analysis of the women’s game. In this work we take an initial step towards analysing professional women’s football. Using event data covering a number of seasons from the top women’s leagues, we perform two analyses. First, we perform an exploratory analysis by computing several technical indicators (e.g., goal scoring rates over the season, conversion rates, shot locations) and then compare and contrast them to the indicators for comparable men’s leagues and find several intriguing differences. Second, we assess whether xG models on one gender are applicable to data from a different gender.

    PDFURLBibTex

  • RL4RealLife

    Maaike Van Roy, Pieter Robberechts, Wen-Chi Yang, Luc De Raedt, Jesse Davis

    Reinforcement learning techniques are often used to model and analyze the behavior of sports teams and players. However, learning these models from observed data is challenging. The data is very sparse and does not include the intended end location of actions which are needed to model decision making. Evaluating the learned models is also extremely difficult as no ground truth is available. In this work, we propose an approach that addresses these challenges when learning a Markov model of professional soccer matches from event stream data. We apply a combination of predictive modeling and domain knowledge to obtain the intended end locations of actions and learn the transition model using a Bayesian approach to resolve sparsity issues. We provide intermediate evaluations as well as an approach to evaluate the final model. Finally, we show the model's usefulness in practice for both evaluating and rating players' decision making using data from the 17/18 and 18/19 English Premier League seasons.

    PDFURLBibTex

  • ICML

    Laurens Devos, Wannes Meert, Jesse Davis

    Machine learned models often must abide by certain requirements (e.g., fairness or legal). This has spurred interested in developing approaches that can provably verify whether a model satisfies certain properties. This paper introduces a generic algorithm called Veritas that enables tackling multiple different verification tasks for tree ensemble models like random forests (RFs) and gradient boosted decision trees (GBDTs). This generality contrasts with previous work, which has focused exclusively on either adversarial example generation or robustness checking. Veritas formulates the verification task as a generic optimization problem and introduces a novel search space representation. Veritas offers two key advantages. First, it provides anytime lower and upper bounds when the optimization problem cannot be solved exactly. In contrast, many existing methods have focused on exact solutions and are thus limited by the verification problem being NP-complete. Second, Veritas produces full (bounded suboptimal) solutions that can be used to generate concrete examples. We experimentally show that our method produces state-of-the-art robustness estimates, especially when executed with strict time constraints. This is exceedingly important when checking the robustness of large datasets. Additionally, we show that Veritas enables tackling more real-world verification scenarios.

    PDFURLArxivBibTex

  • SSAC

    Maaike Van Roy, Pieter Robberechts, Wen-Chi Yang Yang, Luc De Raedt, Jesse Davis

    Analysis of the popular expected goals (xG) metric in soccer has determined that a (slightly) smaller number of high-quality attempts will likely yield more goals than a slew of low-quality ones. This observation has driven a change in shooting behavior. Teams are passing up on shots from outside the penalty box, in the hopes of generating a better shot closer to goal later on. This paper evaluates whether this decrease in long-distance shots is warranted. Therefore, we propose a novel generic framework to reason about decision-making in soccer by combining techniques from machine learning and artificial intelligence (AI). First, we model how a team has behaved offensively over the course of two seasons by learning a Markov Decision Process (MDP) from event stream data. Second, we use reasoning techniques arising from the AI literature on verification to each team's MDP. This allows us to reason about the efficacy of certain potential decisions by posing counterfactual questions to the MDP. Our key conclusion is that teams would score more goals if they shot more often from outside the penalty box in a small number of team-specific locations. The proposed framework can easily be extended and applied to analyze other aspects of the game.

    PDFURLArxivBibTex

  • Gait & Posture

    Pieter Robberechts, Rud Derie, Pieter Van den Berghe, Joeri Gerlo, Dirk De Clercq, Veerle Segers, Jesse Davis

    Gait event detection of the initial contact and toe off is essential for running gait analysis, allowing the derivation of parameters such as stance time. Heuristic-based methods exist to estimate these key gait events from tibial accelerometry. However, these methods are tailored to very specific acceleration profiles, which may offer complications when dealing with larger data sets and inherent biological variability. Therefore, this paper investigates whether a structured machine learning approach can achieve a more accurate prediction of running gait event timings from tibial accelerometry. Force-based event detection acted as the criterion measure in order to assess the accuracy, repeatability and sensitivity of the predicted gait events. A heuristic method and two structured machine learning methods were employed to derive initial contact, toe off and stance time from tibial acceleration signals. Both a structured perceptron model (median absolute error of stance time estimation: 10.00 ± 8.73 ms) and a structured recurrent neural network model (median absolute error of stance time estimation: 6.50 ± 5.74 ms) significantly outperformed the existing heuristic approach (median absolute error of stance time estimation: 11.25 ± 9.52 ms) on data from 93 rearfoot runners. Thus, results indicate that a structured recurrent neural network machine learning model offers the most accurate and consistent estimation of the gait events and its derived stance time during level overground running. The machine learning methods seem less affected by intra- and inter-subject variation within the data, allowing for accurate and efficient automated data output during rearfoot overground running. Furthermore offering possibilities for real-time monitoring and biofeedback during prolonged measurements, even outside the laboratory.

    PDFDOIArxivBibTex

2020

  • IJCAI

    Tom Decroos, Lotte Bransen, Jan Van Haaren, Jesse Davis

    Despite the fact that objectively assessing the impact of the individual actions performed by soccer players during games is a crucial task, most traditional metrics have substantial shortcomings. First, many metrics only consider rare actions like shots and goals which account for less than 2% of all on-the-ball actions. Second, they fail to account for the context in which the actions occurred. This work summarizes several important contributions. First, we describe a language for representing individual player actions on the pitch. This language unifies several existing formats which greatly simplifies automated analysis and this language is becoming widely used in the soccer analytics community. Second, we describe our framework for valuing any type of player action based on its impact on the game outcome while accounting for the context in which the action happened. This framework enables giving a broad overview of a player's performance, including quantifying a player's total offensive and defensive contributions to their team. Third, we provide illustrative use cases that highlight the working and benefits of our framework.

    PDFDOIBibTex

  • PhD thesis

    Tom Decroos

    Soccer analytics has seen an explosion of interest in the last decade. The success of data analysis in other sports has driven soccer clubs and other stakeholders in soccer to wonder if they could also deepen their understanding of the game by analyzing data and translate this deepened understanding into tangible results such as signing good players and winning matches. Consequently, more data than ever is being collected in soccer. One prominent data source is event stream data, which is collected by human annotators who watch video feeds of soccer matches through special annotation software and rigorously describe all on-the-ball actions performed on the pitch such as passes, dribbles, interceptions, tackles, and shots. While event stream data is an incredibly rich data source, gleaning useful soccer insights from it has proven to be difficult in practice. One part of the problem is soccer being a fluid sport that involves many complex interactions between players. Furthermore, soccer's low-scoring nature and susceptibility to chance make it hard to correlate player skill with match results. Another part of the problem is event stream data being hard to analyze in its raw form. Analysts typically have to deal with a number of issues such as parsing complicated data structures, adapting to vendor-specific terminologies, dealing with data sparsity, scaling to millions of data points, and incorporating domain knowledge. These issues have motivated researchers to apply techniques from the field of artificial intelligence (AI) to event stream data, as these techniques are often intended to be used semi-autonomously on large and complicated data sets. Consequently, researchers have successfully used AI techniques such as classification, reinforcement learning, pattern mining, and network analysis to address soccer analytics tasks such as estimating shot quality, rating players, and detecting tactics. However, existing literature on learning from event stream data with AI techniques shows a number of shortcomings. First, no efforts have been made to address the data engineering challenges of event stream data, severely obstructing the reproducibility of papers within the field. Second, no approaches exist for valuing on-the-ball actions that consider the full context in which actions are performed or recognize the value of defensive actions such as tackles and clearances. Third, existing works have not sufficiently explored how to best model the locations and directions of actions when capturing the playing style of teams and players. Most approaches that attempt to capture playing style either rudimentarily divide the pitch into zones or ignore the spatial component of event stream data all together. This dissertation makes three main contributions to the field of soccer analytics that attempt to address these shortcomings. First, to better represent event stream data, we construct a new language that simplifies and unifies the data of different event stream data vendors, alleviating many data engineering challenges and encouraging the reproducibility of soccer analytics research. Second, we propose a framework for assigning values to on-the-ball actions that, compared to simpler metrics and possession-based approaches, considers a more complete view of the context in which actions occur. Our framework uses a simple and elegant formula that formalizes the intuition that all actions in a match are performed with the intention of increasing the chance of scoring a goal and/or decreasing the chance of conceding a goal. The latter point is what allows our framework to recognize the value of defensive actions. Third, we introduce a number of approaches that express the playing style of teams and players based on where on the pitch they perform certain types of actions. Our approaches improve over earlier work by modelling the spatial component of event stream data in a data-driven manner using decomposition techniques such as non-negative matrix factorization and mixture models.

    PDFBibTex

  • ECML/PKDD

    Tom Decroos, Maaike Van Roy, Jesse Davis

    Analyzing playing style is a recurring task within soccer analytics that plays a crucial role in club activities such as player scouting and match preparation. It involves identifying and summarizing prototypical behaviors of teams and players that reoccur both within and across matches. Current techniques for analyzing playing style are often hindered by the sparsity of event stream data (i.e., the same player rarely performs the same action in the same location more than once). This paper proposes SoccerMix, a soft clustering technique based on mixture models that enables a novel probabilistic representation for soccer actions. SoccerMix overcomes the sparsity of event stream data by probabilistically grouping together similar actions in a data-driven manner. We show empirically how SoccerMix can capture the playing style of both teams and players and present an alternative view of a team's style that focuses not on the team's own actions, but rather on how the team forces its opponents to deviate from their usual playing style.

    PDFDOIBibTex

  • Front Bioeng Biotechnol

    Arne De Brabandere, Jill Emmerzaal, Annick Timmermans, Ilse Jonkers, Benedicte Vanwanseele, Jesse Davis

    Hip osteoarthritis patients exhibit changes in kinematics and kinetics that affect joint loading. Monitoring this load can provide valuable information to clinicians. For example, a patient's joint loading measured across different activities can be used to determine the amount of exercise that the patient needs to complete each day. Unfortunately, current methods for measuring joint loading require a lab environment which most clinicians do not have access to. This study explores employing machine learning to construct a model that can estimate joint loading based on sensor data obtained solely from a mobile phone. In order to learn such a model, we collected a dataset from 10 patients with hip osteoarthritis who performed multiple repetitions of nine different exercises. During each repetition, we simultaneously recorded 3D motion capture data, ground reaction force data, and the inertial measurement unit data from a mobile phone attached to the patient's hip. The 3D motion and ground reaction force data were used to compute the ground truth joint loading using musculoskeletal modeling. Our goal is to estimate the ground truth loading value using only the data captured by the sensors of the mobile phone. We propose a machine learning pipeline for learning such a model based on the recordings of a phone's accelerometer and gyroscope. When evaluated for an unseen patient, the proposed pipeline achieves a mean absolute error of 29% for the left hip and 36% for the right hip. While our approach is a step in the direction of using a minimal number of sensors to estimate joint loading outside the lab, developing a tool that is accurate enough to be applicable in a clinical context still remains an open challenge. It may be necessary to use sensors at more than one location in order to obtain better estimates.

    PDFDOIBibTex

  • Front Bioeng Biotechnol

    Rud Derie, Pieter Robberechts, Pieter Berghe, Joeri Gerlo, Dirk De Clercq, Veerle Segers, Jesse Davis

    Ground reaction forces are often used by sport scientists and clinicians to analyze the mechanical risk-factors of running related injuries or athletic performance during a running analysis. An interesting ground reaction force-derived variable to track is the maximal vertical instantaneous loading rate (VILR). This impact characteristic is traditionally derived from a fixed force platform, but wearable inertial sensors nowadays might approximate its magnitude while running outside the lab. The time-discrete axial peak tibial acceleration (APTA) has been proposed as a good surrogate that can be measured using wearable accelerometers in the field. This paper explores the hypothesis that applying machine learning to time continuous data (generated from bilateral tri-axial shin mounted accelerometers) would result in a more accurate estimation of the VILR. Therefore, the purpose of this study was to evaluate the performance of accelerometer-based predictions of the VILR with various machine learning models trained on data of 93 rearfoot runners. A subject-dependent gradient boosted regression trees (XGB) model provided the most accurate estimates (mean absolute error: 5.39 ± 2.04 BW⋅s<sup>–1</sup>, mean absolute percentage error: 6.08%). A similar subject-independent model had a mean absolute error of 12.41 ± 7.90 BW⋅s<sup>–1</sup> (mean absolute percentage error: 11.09%). All of our models had a stronger correlation with the VILR than the APTA (p < 0.01), indicating that multiple 3D acceleration features in a learning setting showed the highest accuracy in predicting the lab-based impact loading compared to APTA.

    PDFDOIBibTex

  • AITS

    Maaike Van Roy, Pieter Robberechts, Tom Decroos, Jesse Davis

    Objectively quantifying a soccer player's contributions within a match is a challenging and crucial task in soccer analytics. Many of the currently available metrics focus on measuring the quality of shots and assists only, although these represent less than 1% of all on-the-ball actions. Most recently, several approaches were proposed to bridge this gap. By valuing how actions increase or decrease the likelihood of yielding a goal, these models are effective tools for quantifying the performances of players for all sorts of actions. However, we lack an understanding of their differences, both conceptually and in practice. Therefore, this paper critically compares two such models: expected threat (xT) and valuing actions by estimating probabilities (VAEP). Both approaches exhibit variety in their design choices, that leads to different top player rankings and major differences in how they value specific actions.

    PDFBibTex

  • AITS

    Tom Decroos, Jesse Davis

    Valuing the actions a soccer player performs in a match is a crucial problem in soccer analytics. While many approaches have been proposed for this problem, a commonality among them is the need to build a model that can predict for a given game state the probability of a goal occurring in the near future. Often these works have two common shortcomings. First, the predictive models are often not thoroughly evaluated or may even be evaluated according to the wrong performance metric. Second, there is a tendency to sacrifice interpretability for performance. Hence, the models often yield no insight into why a given game state has a higher or lower probability of resulting in a goal. This paper analyzes VAEP, a recently proposed approach for valuing actions, and its model for estimating the probability of scoring in the near future. We discuss a number of design choices related to building this model and share insights on how to properly evaluate it. Finally, we replace VAEP’s complicated noninterpretable gradient boosting tree model that uses 151 features with a simpler interpretable Generalized Additive Model (GAM) using only 10 features. We find that the GAM offers nearly identical performance to the more complicated gradient boost model while being interpretable and offering insights into what characteristics of a game state have an effect on the probability of scoring a goal in the near future.

    PDFBibTex

  • Lotte Bransen, Pieter Robberechts, Jesse Davis, Tom Decroos, Jan Van Haaren

    No abstract available

    PDFBibTex

  • MLSA

    Pieter Robberechts, Jesse Davis

    Motivated by the fact that some shots are better than others, the expected goals (xG) metric attempts to quantify the quality of goal-scoring opportunities in soccer. The metric is becoming increasingly popular, making its way to TV analysts’ desks. Yet, a vastly underexplored topic in the context of xG is how these models are affected by the data on which they are trained. In this paper, we explore several data-related questions that may affect the performance of an xG model. We showed that the amount of data needed to train an accurate xG model depends on the complexity of the learner and the number of features, with up to 5 seasons of data needed to train a complex gradient boosted trees model. Despite the style of play changing over time and varying between leagues, we did not find that using only recent data or league-specific models improves the accuracy significantly. Hence, if limited data is available, training models on less recent data or different leagues is a viable solution. Mixing data from multiple data sources should be avoided.

    PDFDOIBibTex

  • Opta Pro Forum

    Pieter Robberechts, Jan Van Haaren, Lotte Bransen, Jesse Davis

    No abstract available

    BibTex

2019

  • PhD thesis

    Tim Op De Beéck

    Research on the analysis of real-world sports data dates back at least to 1958 (Lindsey 1959; Rubin 1958). Advances in technology have caused an explosion of the amount of sports-related data about sports. The abundance of data has attracted the interest of both the academic community and the industry. The aim of this sports analytics community is to leverage the available data to help decision makers to gain a competitive advantage (Alamar and Mehrotra 2011). The advent of wearable technology has yielded a new data source that still has a lot of unexplored potential. These data can assist practitioners to monitor athletes during daily life activities (Kwapisz et al. 2011) and rehabilitation (Um et al. 2017; Whelan et al. 2016), to quantify their training loads (Bourdon et al. 2017; Halson 2014; Jaspers, Brink, et al. 2017), and to analyze their risk of injury (Gabbett and Ullah 2012). From a data science perspective, these continuous monitoring data pose several interesting data challenges. First, combining the data of different athletes is non-trivial due to inter-individual differences. Second, because the behavior of athletes can change and because often only limited individual data are available, it is also non-trivial to model the data on an individual level. Third, the use of subjective measures to quantify certain aspects of the athlete (e.g., perceived wellness), confounding factors (e.g., running speed), and missing values further complicate the analysis of these data. In this thesis we evaluated how data science techniques can provide value to the analysis and interpretation of athletes' training load data. Our main focus is on the analysis of training load data from soccer players and outdoor runners. Specifically, we examined three relevant relationships. First, we studied how soccer players perceive external loads. Second, we modeled the relationship between external and internal load, and perceived wellness of soccer players. Third, we analyzed the relationship between biomechanical movement data of outdoor runners and their perceived fatigue status. We presented three types of evidence to support the dissertation statement. First, we found that both data-driven feature selection methods and simple statistical features can complement expert knowledge. Second, we illustrated that group models can be used to individually monitor an athlete when limited-to-no prior data are available for that athlete. Third, we showed that machine learning techniques are well suited to model the complex relationships that are relevant for the analysis of athletes' training load data: non-linear relationships, relationships between objective and subjective variables, and relationships where multicollinearity exists among the input variables. Additionally, we formulated some lessons learned for data scientists. We argued that modeling the context of and athlete's data, either explicitly or implicitly, can improve the performance of predictive models by adjusting for inter- and intra- subject differences and external factors. We presented several such strategies: standardizing features relative to an individual baseline, predicting a normalized target variable instead of the originally reported target variable, and adding the previous state as a feature. Moreover, we identified subtle data dependencies, that hinder obtaining an unbiased estimation of a model's ability to generalize to unseen data. We identified three limitations of the current thesis. First, we evaluated the methodologies to monitor soccer players on the data of only one club. Second, the data collection protocol to collect outdoor data from runners experimentally controlled for total distance, intensity, and running surface and might have introduced a bias towards reporting higher fatigue scores near the end of the protocol. Third, RPE, a subjective measure used in every relationship of this thesis, quantify muscular fatigue, as well as cardiovascular and psychological fatigue. Future research in this area can benefit from an interdisciplinary collaboration between data scientists, sports scientists and other domain experts. A close collaboration throughout all phases of the data science process can further advance the state of the art. First, it will improve the quality of the data that is being collected. Second, it can help to properly contextualize the data when modeling relevant relationships. Third, it will allow obtaining an unbiased estimation of these predictive models.

    PDFBibTex

  • MLSA

    Kenneth Verstraete, Tom Decroos, Bruno Coussement, Nick Vannieuwenhoven, Jesse Davis

    Soccer players have a variety of skills such as passing, tackling, shooting and dribbling. However, their abilities are not fixed and evolve over time. Understanding this evolution could be interesting from many perspectives. We analyze player skill data from the FIFA video game series by EA Sports using tensor methods. This data can be organized as a tensor over three dimensions, namely players, skills, and age, which we explore in two different ways. First, we use a polyadic decomposition to uncover hidden structures among skills and see how these structures evolve over time. Second, we use a Tucker decomposition to predict how a specific player's skills will evolve over time.

    PDFDOIBibTex

  • MLSA

    Pieter Robberechts, Jan Van Haaren, Jesse Davis

    In-game win probability is a statistical metric that provides a sports team’s likelihood of winning at any given point in a game, based on the performance of historical teams in the same situation. In baseball, basketball and American football, these models serve as a tool to enhance the fan experience, evaluate in-game decision making and measure the risk-reward balance for coaching decisions. In contrast, they have received less attention in association football, because its low-scoring nature makes it far more challenging to analyze. In this paper, we build an in-game win probability model for football. Specifically, we first show that porting existing approaches from other sports does not yield good in-game win probability estimates. Second, we introduce our Bayesian statistical model that utilizes a set of eight variables to predict the running win, tie and loss probabilities for the home team. We train our model using event data from the last four seasons of the major European football competitions. Our results indicate that our model provides well-calibrated probabilities. Finally, we elaborate on two use cases for our win probability metric: enhancing the fan experience and evaluating performance in crucial situations.

    PDFArxivBibTex

  • ECML/PKDD

    Tom Decroos, Jesse Davis

    Transfer fees for soccer players are at an all-time high. To make the most of their budget, soccer clubs need to understand the type of players they have and the type of players that are on the market. Current insights in the playing style of players are mostly based on the opinions of human soccer experts such as trainers and scouts. Unfortunately, their opinions are inherently subjective and thus prone to faults. In this paper, we characterize the playing style of a player in a more rigorous, objective and data-driven manner. We characterize the playing style of a player using a so-called ‘player vector’ that can be interpreted both by human experts and machine learning systems. We demonstrate the validity of our approach by retrieving player identities from anonymized event stream data and present a number of use cases related to scouting and monitoring player development in top European competitions.

    PDFDOIBibTex

  • KDD

    Tom Decroos, Lotte Bransen, Jan Van Haaren, Jesse Davis

    Assessing the impact of the individual actions performed by soccer players during games is a crucial aspect of the player recruitment process. Unfortunately, most traditional metrics fall short in addressing this task as they either focus on rare actions like shots and goals alone or fail to account for the context in which the actions occurred. This paper introduces (1) a new language for describing individual player actions on the pitch and (2) a framework for valuing any type of player action based on its impact on the game outcome while accounting for the context in which the action happened. By aggregating soccer players' action values, their total offensive and defensive contributions to their team can be quantified. We show how our approach considers relevant contextual information that traditional player evaluation metrics ignore and present a number of use cases related to scouting and playing style characterization in the 2016/2017 and 2017/2018 seasons in Europe's top competitions.

    PDFDOIArxivBibTex

  • IACSS

    Jesse Davis, Lotte Bransen, Tom Decroos, Pieter Robberechts, Jan Van Haaren

    A key question within sports analytics is how to analyze match data in order to objectively assess a player's performance during a match. This paper summarizes our recent attempts to address this question for soccer. First, we look at how to assign a value to each on-the-ball action a soccer player performs during a match. Second, we explore how these values depend on the level of mental pressure that the player experienced when performing the action. We conclude by briefly highlighting some potential applications of this work.

    DOIBibTex

  • SSAC

    Lotte Bransen, Pieter Robberechts, Jan Van Haaren, Jesse Davis

    While most existing soccer performance metrics focus on players’ technical and physical performances, they typically ignore the mental pressure under which these performances were delivered. Yet, mental pressure is a recurrent concept in the analysis of players’ or teams’ performances. Hence, this paper takes a first step towards objectively understanding how high-mental pressure situations affect the performances and behavior of soccer players. We introduce an approach that compares soccer players’ performances across different levels of mental pressure. For each game situation, our approach uses a machine learned model to estimate how much mental pressure the player possessing the ball experiences using a combination of match context features and the current game state. Similarly, our approach uses machine learned models to evaluate three aspects of each action performed by the player: the choice of action, the execution of the chosen action, and the action’s expected contribution to the scoreline. We demonstrate the ability of our approach to provide actionable insights for soccer clubs in four relevant use cases: player acquisition, training, tactical decisions, and lineups and substitutions. For example, we identify Houssem Aouar and Xherdan Shaqiri as suitable replacements for Leicester City’s former star Riyad Mahrez. We also identify a large number of needless fouls under pressure as a fixable weakness of Orlando City’s striker Dom Dwyer. Since soccer players are often confronted with high-pressure situations, our metric provides insights in the link between pressure and performance that can provide soccer clubs a competitive advantage.

    PDFURLBibTex

  • Int J Sports Physiol Perform

    Tim Op De Beéck, Arne Jaspers, Michel S. Brink, Wouter G.P. Frencken, Filip Staes, Jesse J. Davis, Werner F. Helsen

    PURPOSE: The influence of preceding load and future perceived wellness of professional soccer players is unexamined. This paper simultaneously evaluates the external load (EL) and internal load (IL) for different time frames in combination with presession wellness to predict future perceived wellness using machine learning techniques. METHODS: Training and match data were collected from a professional soccer team. The EL was measured using global positioning system technology and accelerometry. The IL was obtained using the rating of perceived exertion multiplied by duration. Predictive models were constructed using gradient-boosted regression trees (GBRT) and one naive baseline method. The individual predictions of future wellness items (ie, fatigue, sleep quality, general muscle soreness, stress levels, and mood) were based on a set of EL and IL indicators in combination with presession wellness. The EL and IL were computed for acute and cumulative time frames. The GBRT model's performance on predicting the reported future wellness was compared with the naive baseline's performance by means of absolute prediction error and effect size. RESULTS: The GBRT model outperformed the baseline for the wellness items such as fatigue, general muscle soreness, stress levels, and mood. In addition, only the combination of EL, IL, and presession perceived wellness resulted in nontrivial effects for predicting future wellness. Including the cumulative load did not improve the predictive performances. CONCLUSIONS: The findings may indicate the importance of including both acute load and presession perceived wellness in a broad monitoring approach in professional soccer.

    PDFDOIBibTex

  • Machine Learning

    Werner Dubitzky, Philippe Lopes, Jesse Davis, Daniel Berrar

    How well can machine learning predict the outcome of a soccer game, given the most commonly and freely available match data? To help answer this question and to facilitate machine learning research in soccer, we have developed the Open International Soccer Database. Version v1.0 of the Database contains essential information from 216,743 league soccer matches from 52 leagues in 35 countries. The earliest entries in the Database are from the year 2000, which is when football leagues generally adopted the “three points for a win” rule. To demonstrate the use of the Database for machine learning research, we organized the 2017 Soccer Prediction Challenge. One of the goals of the Challenge was to estimate where the limits of predictability lie, given the type of match data contained in the Database. Another goal of the Challenge was to pose a real-world machine learning problem with a fixed time line and a genuine prediction task: to develop a predictive model from the Database and then to predict the outcome of the 206 future soccer matches taking place from 31 March 2017 to the end of the regular season. The Open International Soccer Database is released as an open science project, providing a valuable resource for soccer analysts and a unique benchmark for advanced machine learning methods. Here, we describe the Database and the 2017 Soccer Prediction Challenge and its results.

    PDFBibTex

  • Machine Learning

    Daniel Berrar, Philippe Lopes, Jesse Davis, Werner Dubitzky

    No abstract available

    PDFBibTex

  • StatsBomb

    Pieter Robberechts

    Pressing is an essential part of defense in football. Broadly speaking, the goal is to quickly win the ball back by putting pressure on the player in possession. The successes of coaches like Guardiola, Klopp, Sarri and Pochetino in deploying a high press has increased the profile and interest in this strategy. Yet, pressing is a phenomenon that has not yet received much attention from researchers in football analytics. Previous research has focused on the spatial aspect and the intensity of pressing, but we currently lack metrics to quantify its effectiveness in different contexts. This paper introduces a novel metric that quantifies the effectiveness of pressing in different game scenarios as a trade-off between the benefits of recovering the ball versus the cost leaving the defensive structure, which makes passing through the lines easier for the opposition. We show how our metric can be used in practice through a number of use cases in the 2018/19 season of Europe’s top leagues.

    PDFyoutubeBibTex

  • StatsBomb

    Tom Decroos, Jesse Davis

    Valuing the actions a soccer player performs in a match is a crucial problem in soccer analytics. While many approaches have been proposed for this problem, a commonality among them is the need to build a model that can predict for a given game state the probability of a goal occurring in the near future. Often these works have two common shortcomings. First, the predictive models are often not thoroughly evaluated or may even be evaluated according to the wrong performance metric. Second, there is a tendency to sacrifice interpretability for performance. Hence, the models often yield no insight into why a given game state has a higher or lower probability of resulting in a goal. This paper analyzes VAEP, a recently proposed approach for valuing actions, and its model for estimating the probability of scoring in the near future. We discuss a number of design choices related to building this model and share insights on how to properly evaluate it. Finally, we replace VAEP’s complicated noninterpretable gradient boosting tree model that uses 151 features with a simpler interpretable Generalized Additive Model (GAM) using only 10 features. We find that the GAM offers nearly identical performance to the more complicated gradient boost model while being interpretable and offering insights into what characteristics of a game state have an effect on the probability of scoring a goal in the near future.

    PDFyoutubeBibTex

  • Footwear

    Rud Derie, Pieter Robberechts, Pieter Berghe, Joeri Gerlo, Dirk De Clercq, Veerle Segers, Jesse Davis

    No abstract available

    PDFDOIBibTex

  • Jan Van Haaren, Pieter Robberechts, Tom Decroos, Lotte Bransen, Jesse Davis

    No abstract available

    URLBibTex

2018

  • MLSA

    Pieter Robberechts, Jesse Davis

    In this study we compare result-based Elo ratings and goal-based ODM (Offense Defense Model) ratings as covariates in an ordered logit regression and bivariate Poisson model to generate predictions for the outcome of the 2018 FIFA World Cup. To this end, we first estimate probabilities of match results between all competing nations. With an evaluation on the four previous World Cups between 2002 and 2014, we show that an ordered logit model with Elo ratings as a single covariate achieves the best performance. Secondly, via Monte Carlo simulations we compute each team's probability of advancing past a given stage of the tournament. Additionally, we apply our models on the Open International Soccer Database and show that our approach leads to good predictions for domestic league football matches as well.

    PDFDOIBibTex

  • PLOS ONE

    Arne De Brabandere, Tim Op De Beéck, Wannes Schütte, Benedicte Vanwanseele, Jesse Davis

    Maximal oxygen uptake (VO2max) is often used to assess an individual’s cardiorespiratory fitness. However, measuring this variable requires an athlete to perform a maximal exercise test which may be impractical, since this test requires trained staff and specialized equipment, and may be hard to incorporate regularly into training programs. The aim of this study is to develop a new model for predicting VO2max by exploiting its relationship to heart rate and accelerometer features extracted during submaximal running. To do so, we analyzed data collected from 31 recreational runners (15 men and 16 women) aged 19-26 years who performed a maximal incremental test on a treadmill. During this test, the subjects’ heart rate and acceleration at three locations (the upper back, the lower back and the tibia) were continuously measured. We extracted a wide variety of features from the measurements of the warm-up and the first three stages of the test and employed a data-driven approach to select the most relevant ones. Furthermore, we evaluated the utility of combining different types of features. Empirically, we found that combining heart rate and accelerometer features resulted in the best model with a mean absolute error of 2.33 ml ⋅ kg−1 ⋅ min−1 and a mean absolute percentage error of 4.92%. The model includes four features: gender, body mass, the inverse of the average heart rate and the inverse of the variance of the total tibia acceleration during the warm-up stage of the treadmill test. Our model provides a practical tool for recreational runners in the same age range to estimate their VO2max from submaximal running on a treadmill. It requires two body-worn sensors: a heart rate monitor and an accelerometer positioned on the tibia.

    PDFDOIBibTex

  • ECML/PKDD

    Tom Decroos, Kurt Schütte, Tim Op De Beéck, Benedicte Vanwanseele, Jesse Davis

    Patients with sports-related injuries need to learn to perform healthilitative exercises with correct movement patterns. Unfortunately, the feedback a physiotherapist can provide is limited by the visitation frequency of the patient. We study the feasibility of a system that automatically provides feedback on correct movement patterns to patients using a Microsoft Kinect camera and Machine Learning techniques. We discuss several challenges related to the Kinect's proprietary software, the Kinect data's heterogeneity, and the Kinect data's temporal component. We introduce AMIE, a machine learning pipeline that detects the exercise being performed, the exercise's correctness, and if applicable, the mistake that was made. To evaluate AMIE, ten participants were instructed to perform three types of typical rehabilitation exercises (squats, forward lunges and side lunges) demonstrating both correct movement patterns and frequent types of mistakes, while being recorded with a Kinect. AMIE detects the type of exercise almost perfectly with 99% accuracy and the type of mistake with 73% accuracy.

    PDFDOIBibTex

  • KDD

    Tom Decroos, Jan Van Haaren, Jesse Davis

    Sports teams are nowadays collecting huge amounts of data from training sessions and matches. The teams are becoming increasingly interested in exploiting these data to gain a competitive advantage over their competitors. One of the most prevalent types of new data is event stream data from matches. These data enable more advanced descriptive analysis as well as the potential to investigate an opponent's tactics in greater depth. Due to the complexity of both the data and game strategy, most tactical analyses are currently performed by humans reviewing video and scouting matches in person. As a result, this is a time-consuming and tedious process. This paper explores the problem of automatic tactics detection from event-stream data collected from professional soccer matches. We highlight several important challenges that these data and this problem setting pose. We describe a data-driven approach for identifying patterns of movement that account for both spatial and temporal information which represent potential offensive tactics. We evaluate our approach on the 2015/2016 season of the English Premier League and are able to identify interesting strategies per team related to goal kicks, corners and set pieces.

    PDFDOIBibTex

  • KDD

    Tim Op De Beéck, Wannes Meert, Kurt Schütte, Benedicte Vanwanseele, Jesse Davis

    Running is extremely popular and around 10.6 million people run regularly in the United States alone. Unfortunately, estimates indicated that between 29% to 79% of runners sustain an overuse injury every year. One contributing factor to such injuries is excessive fatigue, which can result in alterations in how someone runs that increase the risk for an overuse injury. Thus being able to detect during a running session when excessive fatigue sets in, and hence when these alterations are prone to arise, could be of great practical importance. In this paper, we explore whether we can use machine learning to predict the rating of perceived exertion (RPE), a validated subjective measure of fatigue, from inertial sensor data of individuals running outdoors. We describe how both the subjective target label and the realistic outdoor running environment introduce several interesting data science challenges. We collected a longitudinal dataset of runners, and demonstrate that machine learning can be used to learn accurate models for predicting RPE.

    PDFDOIBibTex

  • Int J Sports Physiol Perform

    Arne Jaspers, Tim Op De Beéck, Michel Brink, Wouter Frencken, Filip Staes, Jesse Davis, Werner Helsen

    PURPOSE: Machine learning may contribute to understanding the relationship between the external load and internal load in professional soccer. Therefore, the relationship between external load indicators and the rating of perceived exertion (RPE) was examined using machine learning techniques on a group and individual level. METHODS: Training data were collected from 38 professional soccer players over two seasons. The external load was measured using global positioning system technology and accelerometry. The internal load was obtained using the RPE. Predictive models were constructed using two machine learning techniques, artificial neural networks (ANNs) and least absolute shrinkage and selection operator (LASSO), and one naive baseline method. The predictions were based on a large set of external load indicators. Using each technique, one group model involving all players and one individual model for each player was constructed. These models' performance on predicting the reported RPE values for future training sessions was compared to the naive baseline's performance. RESULTS: Both the ANN and LASSO models outperformed the baseline. Additionally, the LASSO model made more accurate predictions for the RPE than the ANN model. Furthermore, decelerations were identified as important external load indicators. Regardless of the applied machine learning technique, the group models resulted in equivalent or better predictions for the reported RPE values than the individual models. CONCLUSIONS: Machine learning techniques may have added value in predicting the RPE for future sessions to optimize training design and evaluation. Additionally, these techniques may be used in conjunction with expert knowledge to select key external load indicators for load monitoring.

    PDFDOIBibTex

2017

  • MLSA

    Tom Decroos, Jan Van Haaren, Vladimir Dzyuba, Jesse Davis

    An important task in sports analytics is to devise player-performance metrics that allow managers to take better-informed decisions. While several such metrics have been proposed for baseball, basketball, and ice hockey, this task has virtually remained unexplored to date for soccer. This paper presents an approach for automatically rating the actions performed by soccer players based on historical match data. The approach considers all player actions that contribute to a team’s offensive output and accounts for the context of the actions.

    PDFBibTex

  • MLSA

    Ruben Vroonen, Tom Decroos, Jan Van Haaren, Jesse Davis

    Projecting how a player’s skill level will evolve in the future is a crucial problem faced by sports teams. Traditionally, player projections have been evaluated by human scouts, who are subjective and may suffer from biases. More recently, there has been interest in automated projection systems such as the PECOTA system for baseball and the CARMELO system for basketball. In this paper, we present a projection system for soccer players called APROPOS which is inspired by the CARMELO and PECOTA systems. APROPOS predicts the potential of a soccer player by searching a historical database to identify similar players of the same age. It then bases its prediction for the target player’s progression on how the similar previous players actually evolved. We evaluate APROPOS on players from the five biggest European soccer leagues and show that it clearly outperforms a more naive baseline.

    PDFBibTex

  • AAAI

    Tom Decroos, Vladimir Dzyuba, Jan Van Haaren, Jesse Davis

    Sports broadcasters are continuously seeking to make their live coverages of soccer matches more attractive. A recent innovation is the “highlight channel,” which shows the most interesting events from multiple matches played at the same time. However, switching between matches at the right time is challenging in fast-paced sports like soccer, where interesting situations often evolve as quickly as they disappear again. This paper presents the POGBA algorithm for automatically predicting highlights in soccer matches, which is an important task that has not yet been addressed. POGBA leverages spatio-temporal event streams collected during matches to predict the probability that a particular game state will lead to a goal. An empirical evaluation on a real-world dataset shows that POGBA outperforms the baseline algorithms in terms of both precision and recall.

    PDFDOIBibTex

2016

  • MLSA

    Vincent Vercruyssen, Luc De Raedt, Jesse Davis

    Given the advances in camera-based tracking systems, many soccer teams are able to record data about the players' position during a game. Analysing these data is challenging, since they are fine-grained, contain implicit relational information between players, and contain the dynamics of the game. We propose the use of qualitative spatial reasoning techniques to address these challenges, and test our approach by learning a model for pass prediction over a real-world soccer dataset. Experimental evaluation shows that our approach is capable of learning meaningful models. Since we employ an inductive logic programming system to learn the model, it has the added benefit of producing interpretable rules.

    PDFBibTex

  • Large Scale Sports Analytics

    Jan Van Haaren, Siebe Hannosset, Jesse Davis

    This paper explores the task of automatic strategy detection from event-stream data collected from professional soccer matches. Concretely, we focus on discovering interesting event sequences that lead to an attempt on goal. We describe a data-driven approach for identifying patterns of movement that account for both spatial and temporal information which represent potential strategies.

    PDFBibTex

  • KDD

    Jan Van Haaren, Horesh Ben Shitrit, Jesse Davis, Pascal Fua

    This paper proposes a relational-learning based approach for discovering strategies in volleyball matches based on optical tracking data. In contrast to most existing methods, our approach permits discovering patterns that account for both spatial (that is, partial configurations of the players on the court) and temporal (that is, the order of events and positions) aspects of the game. We analyze both the men's and women's final match from the 2014 FIVB Volleyball World Championships, and are able to identify several interesting and relevant strategies from the matches.

    PDFBibTex

2015

  • IDA

    Jan Van Haaren, Vladimir Dzyuba, Siebe Hannosset, Jesse Davis

    In recent years, many professional sports clubs have adopted camera-based tracking technology that captures the location of both the players and the ball at a high frequency. Nevertheless, the valuable information that is hidden in these performance data is rarely used in their decision-making process. What is missing are the computational methods to analyze these data in great depth. This paper addresses the task of automatically discovering patterns in offensive strategies in professional soccer matches. To address this task, we propose an inductive logic programming approach that can easily deal with the relational structure of the data. An experimental study shows the utility of our approach.

    PDFDOIBibTex

  • MathSport International

    Jan Van Haaren, Jesse Davis

    In this paper, we investigate how accurately the final league tables of domestic football leagues can be predicted, both before the start of the season and during the course of the season. To this end, we perform an extensive empirical evaluation that compares two flavors of the well-established Elo-ratings and the recently introduced pi-ratings. We validate the different approaches using a large volume of historical match results from several European football leagues. We assess how well each ranking system performs on this task, investigate what is the most natural metric to measure the quality of a predicted final league table, and the minimum number of matches that needs to be played in order to yield useful predictions. We find that the proportion of correctly predicted relative positions is a natural metric to assess the quality of the predicted final league tables and that the traditional Elo-rating system performs well in most settings we considered.

    PDFBibTex

  • CW Reports

    Jan Van Haaren, Jesse Davis

    Dit rapport analyseert de prestaties van de clubs in de reguliere competitie van het seizoen 2014-2015 van de Belgische Pro League. Deze studie berekent een resem parameters die een objectief beeld schetsen van hun dominantie, doelgerichtheid, aanvallende ingesteldheid, efficiëntie voor doel en agressiviteit. De analyse van deze parameters leverde een aantal interessante vaststellingen op. AA Gent was de dominantste ploeg met gemiddeld 58,23% balbezit. Club Brugge was de aanvallendste ploeg met gemiddeld 14,13 doelpogingen per wedstrijd. KV Kortrijk zette 52,94% van zijn schoten binnen het kader om in een doelpunt en was daarmee de efficiëntste ploeg. Mouscron-Péruwelz beging gemiddeld de meeste overtredingen per wedstrijd (16,59) en incasseerde gemiddeld de meeste gele kaarten per wedstrijd (2,66).

    PDFBibTex

  • MathSport International

    Joris Renkens, Yuri Passchyn, Jesse Davis

    Since the success of the 2002 Oakland Athletics baseball team, the use of statistics to shape roster decisions has received growing attention in many major sports. This can take many forms such as identifying which statistics are most important for assessing a player's ability and projecting how a player's performance will transfer from one setting to another. This abstract focuses on the second task where computational approaches have been able to provide valuable insights that complement traditional scouting-based approaches. Examples of such systems include the PECOTA system that projects how minor league baseball players would perform in the major league and the SCHOENE projection system for Basketball that predicts how players will transition from NCAA basketball (the American university competition) to the NBA, as well as the effect of age on players that are currently playing in the NBA. In this paper, we present an approach for this task for Belgian basketball. Each year, approximately ten players join a Belgian team directly after playing NCAA basketball (the American univeristy competition). Since the Belgian league and the college game are so different, it is difficult to predict how college players will fare in the Belgian competition. We use five seasons worth of data both in the Belgian competition and in the NCAA. Given a target player, we employ a nearest-neighbors-based approach to identify a set of similar players and use their statistics to predict the statistics of the player of interest. This method is not only useful for predicting the statistics of players entering the league, but can also be used to determine how aging will effect a player's performance.

    BibTex

2014

  • CW Reports

    Jan Van Haaren, Jesse Davis

    This report evaluates the performances of the countries in the 2014 FIFA World Cup group stage by comparing each country's expected performances with their actual performances. We investigate the performances in the individual matches as well as aggregates per country and per region. While the sample size is small, the statistics seem to confirm the perceptions that teams from the Americas have generally performed well and exceeded expectations.

    PDFBibTex

  • SenseML

    Toon Van Craenendonck, Tim Op De Beéck, Wannes Meert, Benedicte Vanwanseele, Jesse Davis

    Capturing the movements of a patient performing a rehabilitation exercise currently involves an extensive lab setup. The goal of this study is to investigate whether a 3D camera, such as the Microsoft Kinect (TM), can be used to monitor patients locally. Specifically we are interested in the lower limbs since most 3D camera algorithms focus on the upper body while for rehabilitation, the lower body is crucial. This paper presents two particle-filtering based algorithms for accurate tracking. The first algorithm estimates the configuration of the lower limbs simultaneously while the second one estimates the configuration of one limb at a time. We compare our estimates with a gold standard and find that we are able to recognize most movement characteristics. Furthermore, our approach is better at tracking the height of the foot and yields more stable tracking results than the NITE skeleton tracker.

    PDFBibTex

  • CW Reports

    Jan Van Haaren, Tim Op De Beéck, Jesse Davis

    Dit rapport bevat een prestatie-analyse van de play-offs van de Belgische voetbalcompetitie voor het seizoen 2013-2014. De gedetailleerde statistieken die voetbalwebsite Soccerway dit voetbalseizoen voor het eerst beschikbaar stelde voor de Belgische competitie werden gecombineerd tot dertien parameters die een inzicht verschaffen in de dominantie, aanvallende ingesteldheid, efficiëntie voor doel en agressiviteit van iedere club. Het rapport vergelijkt de prestaties van de clubs in play-off I en play-off II met die in de reguliere competitie.

    PDFBibTex

  • CW Reports

    Jan Van Haaren, Tim Op De Beéck, Jesse Davis

    Dit rapport bevat een prestatie-analyse van de reguliere Belgische voetbalcompetitie voor het seizoen 2013-2014. De gedetailleerde statistieken die voetbalwebsite Soccerway dit voetbalseizoen voor het eerst beschikbaar stelde voor de Belgische competitie werden gecombineerd tot dertien parameters die een inzicht verschaffen in de dominantie, aanvallende ingesteldheid, efficiëntie voor doel en agressiviteit van iedere club. De analyse heeft tot een aantal opvallende resultaten geleid. Standard sloot de reguliere competitie af als leider, maar bekleedt pas de tiende plaats op de lijst van meest dominante clubs. De club compenseert het mindere balbezit echter met een enorme doelgerichtheid en doet op dat vlak beter dan alle andere eersteklassers. OH Leuven voert met voorsprong het klassement van geïncasseerde rode kaarten aan, terwijl de spelers van Anderlecht opvallend weinig met een gele kaart bestraft worden.

    PDFBibTex

2013

  • LStat

    Jan Van Haaren, Albrecht Zimmermann, Joris Renkens, Guy Broeck, Tim Op De Beéck, Wannes Meert, Jesse Davis

    No abstract available

    PDFBibTex

  • IDA

    Vladimir Dzyuba, Matthijs Leeuwen

    Although subgroup discovery aims to be a practical tool for exploratory data mining, its wider adoption is hampered by redundancy and the re-discovery of common knowledge. This can be remedied by parameter tuning and manual result filtering, but this requires considerable effort from the data analyst. In this paper we argue that it is essential to involve the user in the discovery process to solve these issues. To this end, we propose an interactive algorithm that allows a user to provide feedback during search, so that it is steered towards more interesting subgroups. Specifically, the algorithm exploits user feedback to guide a diverse beam search. The empirical evaluation and a case study demonstrate that uninteresting subgroups can be effectively eliminated from the results, and that the overall effort required to obtain interesting and diverse subgroup sets is reduced. This confirms that within-search interactivity can be useful for data analysis.

    PDFDOIBibTex

  • MLSA

    Albrecht Zimmermann, Sruthi Moorthy, Zifan Shi

    Most existing work on predicting NCAAB matches has been developed in a statistical context. Trusting the capabilities of ML techniques, particularly classification learners, to uncover the importance of features and learn their relationships, we evaluated a number of different paradigms on this task. In this paper, we summarize our work, pointing out that attributes seem to be more important than models, and that there seems to be an upper limit to predictive quality.

    ArxivBibTex

2012

2011

  • ILP

    Jan Van Haaren, Guy Broeck

    Association football has recently seen some radical changes, leading to higher financial stakes, further professionalization and technical advances. This gave rise to large amounts of data becoming available for analysis. Therefore, we propose football-related predictions as an interesting application for relational learning. We argue that football data is highly structured and most naturally represented in a relational way. Furthermore, we identify interesting learning tasks which require a relational approach, such as link prediction or structured output learning. Early experiments show that this relational approach is competitive with a propositionalized approach for the prediction of individual football matches’ goal difference.

    PDFBibTex