Women’s FIFA World Cup 2019: some game data

I have been using some basic R code to look at game data from the 2019 FIFA Women’s World Cup. I have tried out some of the colourblind-friendly palettes too (link).

Some of my data: FIFA provided a record in minutes of actual game time and total game time. I used the number of fouls awarded by referees as a background theme:

There were some temperature data and I used these to look how much the ball was out of play (in minutes):

There were some humidity data:

The median time for ball in play at the 2019 FIFA Women’s World Cup was 55 minutes. I looked at three types of games: less than the median; greater than the median; and three extra time games:

My data for these visualisations are shared in a GitHub repository (link).

Performance and Analytic Narratives

I have continued thinking about performance issues raised by Craig Duncan (link), Dave Reddin and Tony Strudwick (link).

This thinking has coincided with me revisiting the literature on analytic narratives (link). One of the contributors to this literature is Margaret Levi (link). She wrote:


The narrative of analytic narratives establishes the actual and principal players, their goals, and their preferences while also illuminating the effective rules of the game, constraints, and incentives. Narrative is the story being told but as a detailed and textured account of context and process, with concern for both sequence and temporality. (2002:112)

I do monitor a range of performances in a variety of sports. For a long time now, I have been wondering about how the data lead me to a variety of institutional approaches to performance and the vectors that lead to success (or the opposite).

One of my examples is the use of Elo Ratings in a football code to contemplate cultures of success. In this first visualisation, I am keen to look at the long-term performance of teams (their class). I have chosen three teams: a gold standard consistently successful, a second near the median performance for the code, and a third trying to find its way in a very competitive world:

A second visualisation, with the same three teams looks at current form:

Both visualisations enable me to think about performance environments and allow me tointegrate the observations Craig, Dave and Tony have made in the context of narratives about performance.

In this context, I thought Margaret Levi’s conclusion was very relevant to conversations about performance in sport:


Analytic narratives involve choosing a problem or puzzle, then building a model to explicate the logic of the explanation and to elucidate the key decision points and possibilities, and finally evaluating the model through comparative statics and the testable implications the model generates. (2002:113)

I do see this puzzling as central to our conversations about performance and the opportunities we have to compare empirical events with theories about performance.

Collecting data from the Women’s Football World Cup 2019

I am delighted there are so many ways to collect data from this year’s Women’s World Cup football tournament (FIFA link). I have data going back to the 2011 Tournament (link). These are shared on my blog and on Google Sheets for ease of access. I have posted some early tweets on Twitter too (link) and I have a GitHub repository (link). I am monitoring the # tag #FIFAWWC (link).

Overnight, I started searching for other links to the data.

I noticed a feed from the FAW Analysis Office in Newport announcing work on their analysis of the Tournament (link). The same night, I heard of WomensFootyStat (link) and their use of Opta data to provide “shotmaps, xG timeline, and pass maps for now. Tweeting manually for now but hoping to automate soon and get them out straight after each game”. Some weeks earlier, I heard about the StatsBomb Open Data project and the availability of data on their GitHub account (link). Ron Smith (link) and Ben Mayhew (link) are tweeting regularly.

I am trying to learn more about R and RStudio during this World Cup and have started searching for other papers on this topic. I was delighted to find a post on RBloggers by Achim Zeileis (link) that referred in detail to a paper by Andreas Groll and coplleagues’  (2019) paper “Hybrid Machine Learning Forecasts for the FIFA Women’s World Cup 2019”, arXiv:1906.01131, arXiv.org E-Print Archive (link).

In their paper, their abstract notes that they combine:

two different ranking methods together with several other predictors in a joint random forest approach for the scores of soccer matches. The first ranking method is based on the bookmaker consensus, the second ranking method estimates adequate ability parameters that reflect the current strength of the teams best. The proposed combined approach is then applied to the data from the two previous FIFA Women’s World Cups 2011 and 2015. Finally, based on the resulting estimates, the FIFA Women’s World Cup 2019 is simulated repeatedly and winning probabilities are obtained for all teams. The model clearly favors the defending champion USA before the host France.

I need to look up this paper in more detail now and hope to find lots more connections and open sharing, including the Hope Floats blog (link) that is collecting as many voices as it can about the World Cup in sixty days. I will be keeping an eye on what Simon Gleave is doing too (link).