I was following some data live on the Australia v Brazil group game at the 2019 FIFA World Cup (link).
It was the first time I had noticed that a Live Win Probability was being used in this way and I decided to track it with a Google Sheet of my own (link).
By the time I had reached the graph, Brazil scored a penalty in the 27th minute (Martha) and a second goal in the 38th minute (Cristiana). I am using an Elo measurement at this Tournament to assess probability based on not losing if a higher rated team scores first. Based on these two goals, the probability of Brazil not losing was moving towards at least 0.8 out of 1.
As I watched the graph progress towards half time, Australia scored (Foord). I wondered how the probability graph might respond … and what both coaches might do at half time. I did think the late Australian goal in the half might add an interesting stage in the probability of game outcome given that Australia was the higher rated Elo team in this game.
At the start of the second half, the Brazilian coach replaced Martha with Ludmila (a loss of approximately 120 caps) and Formiga with Luana (Luana Bertolucci Paixão) (a loss of approximately 150 caps) . Australia made no changes. Brazil conceded a long range goal in the 58th minute (Logarzo) and an own goal in the 66th minute (the goal was confirmed by the Video Assistant Referee (VAR, link)). Australia won the game 3v2.
I wondered how we might factor these dynamics into our visualisations and augment our machine intelligence with a reciprocal understanding about game playing in activity that has its own as well as general time series momentum. A paper by Michael Lopez and his colleagues (link) has set me off thinking about these dynamics.
Given the growth in the use of these visualisations, I do think these are very important conversations to be having now.
I am delighted there are so many ways to collect data from this year’s Women’s World Cup football tournament (FIFA link). I have data going back to the 2011 Tournament (link). These are shared on my blog and on Google Sheets for ease of access. I have posted some early tweets on Twitter too (link) and I have a GitHub repository (link). I am monitoring the # tag #FIFAWWC (link).
Overnight, I started searching for other links to the data.
I noticed a feed from the FAW Analysis Office in Newport announcing work on their analysis of the Tournament (link). The same night, I heard of WomensFootyStat (link) and their use of Opta data to provide “shotmaps, xG timeline, and pass maps for now. Tweeting manually for now but hoping to automate soon and get them out straight after each game”. Some weeks earlier, I heard about the StatsBomb Open Data project and the availability of data on their GitHub account (link). Ron Smith (link) and Ben Mayhew (link) are tweeting regularly.
I am trying to learn more about R and RStudio during this World Cup and have started searching for other papers on this topic. I was delighted to find a post on RBloggers by Achim Zeileis (link) that referred in detail to a paper by Andreas Groll and coplleagues’ (2019) paper “Hybrid Machine Learning Forecasts for the FIFA Women’s World Cup 2019”, arXiv:1906.01131, arXiv.org E-Print Archive (link).
In their paper, their abstract notes that they combine:
two different ranking methods together with several other predictors in a joint random forest approach for the scores of soccer matches. The first ranking method is based on the bookmaker consensus, the second ranking method estimates adequate ability parameters that reflect the current strength of the teams best. The proposed combined approach is then applied to the data from the two previous FIFA Women’s World Cups 2011 and 2015. Finally, based on the resulting estimates, the FIFA Women’s World Cup 2019 is simulated repeatedly and winning probabilities are obtained for all teams. The model clearly favors the defending champion USA before the host France.
I need to look up this paper in more detail now and hope to find lots more connections and open sharing, including the Hope Floats blog (link) that is collecting as many voices as it can about the World Cup in sixty days. I will be keeping an eye on what Simon Gleave is doing too (link).
The Women’s FIFA World Cup is underway in France. There is an official web site (link) to support the Tournament.
I have started to build a repository on GitHub (link) for the data generated by FIFA. I am using some very basic code for my RStudio record of the games played (link).
For my first look at the data, I have monitored: ball in play (in minutes); total game time (in minutes); and weather data. I am using FIFA as the accurate source of these data.
My three visualisations are:
Ball in Play by Country of Origin of Referee
Impact of Humidity on Game Time
Impact of Temperature on Game Time
I am hopeful that I will find lots of ways to explore the FIFA data. At the moment, I am particularly interested in game time played in minutes (a median time of 53 minutes after seven games). My Google Sheets (link) aims to share data from the Tournament and follows on from a format used in 2015 (link).