Predicting the Outcome of the 2014 FIFA World Cup

3692555438_3369c3719d_o

Introduction

Thirty two teams have qualified for the 2014 FIFA World Cup in Brazil.

The teams are allocated into the following Groups (with links to EA Infographics for each team). The number in brackets by each team is their FIFA Ranking on 8 May 2014 (a new ranking will be published on 6 June.)

Group A – Brazil (4), Croatia (20), Mexico (19), Cameroon (50)
Group B
– Spain (1), Netherlands (15), Chile (13), Australia (59)
Group C
– Colombia (5), Greece (10), Ivory Coast (21), Japan (47)
Group D
– Uruguay (6), Costa Rica (34), England (11), Italy (9)
Group E
– Switzerland (8), Ecuador (28), France (16), Honduras (30)
Group F
– Argentina (7), Bosnia Herzegovina (25), Iran (37), Nigeria (44)
Group G
– Germany (2), Portugal (3), Ghana (38), United States (14)
Group H
– Belgium (12), Algeria (25), Russia (18), Korea Republic (55)

The Elo Rankings of these teams are:

Group A – Brazil (1), Croatia (21), Mexico (16), Cameroon (56)
Group B
– Spain (2), Netherlands (5), Chile (10), Australia (33)
Group C
– Colombia (8), Greece (20), Ivory Coast (22), Japan (24)
Group D
– Uruguay (9), Costa Rica (32), England (6), Italy (11)
Group E
– Switzerland (16), Ecuador (19), France (12), Honduras (45)
Group F
– Argentina (4), Bosnia Herzegovina (25), Iran (34), Nigeria (30)
Group G
– Germany (3), Portugal (7), Ghana (38), United States (13)
Group H
– Belgium (14), Algeria (53), Russia (15), Korea Republic (42)

Predictions

Goal Scorers

Back in December 2013, Opta identified ten contenders for the leading goalscorer at the World Cup:

  1. Neymar (Brazil)
  2. Lionel Messi (Argentina)
  3. Radamel Falcao (Columbia)
  4. Cristiano Ronaldo (Portugal)
  5. Luis Suarez (Uruguay)
  6. Fred (Brazil)
  7. Robin van Persie (Netherlands)
  8. Sergio Aguero (Argentina)
  9. Gonzalo Higuain (Argentina)
  10. Mesut Ozil (Germany)

Winning Team

Goldman Sachs has published a prediction of the winning World Cup team. Their approach to prediction includes:

  • A stochastic model of the outcomes for each of the 64 World Cup games.
  • A regression analysis of all full international games from 1960 (using goals scored).
  • Difference in Elo rankings between both teams (“the most powerful variable in the model”).
  • A country-specific dummy variable relating to World Cup play.
  • Home advantage (country and continent).
  • A Monte Carlo simulation with 100,000 draws.

The Goldman Sachs model “does not use any information on the quality of teams or individual players that is not reflected in a team’s track record”.  The approach is “purely statistical”.

This approach yields the following outcomes.

Exhibit 3

Their knockout stage scenario is:

KO
The model predicts that Brazil, Germany, Argentina and Spain will reach the semifinals, and that Brazil will beat Argentina in the final. Goldman Sachs propose to “update these predictions after every game of the tournament on our portal“.<

Andrew Yuan shared his World Cup predictions on Visual.ly earlier this week. He investigated factors “that are measurable, available, and can be good indicators of a match outcome”.

He has provided a detailed account of his methodology on github. Andrew has looked at the outcomes of 13.337 FIFA official matches since 1994 involving the 2014 World Cup teams. He looked at each team’s relation in FIFA ranking tables, the location where the match took place (home, away or neutral venue) and the proportion of matches won. His model uses logistic regression.

He has Brazil as his probable winners. I liked his link with the FIFA Ranking.

AY

David Dormagen (2014) has presented a very clear account of a simulation model he developed to predict the outcome of the 2014 FIFA World Cup. His approach allows for the “integration of rating systems and rules where either no clear formula for a probability other than a win or loss exists or where the historical data is not enough to derive such a formula”.  In addition “We are also able to combine the results from diff erent rating methods with user-given weights without influencing other calculations, such as the calculation of the draw-probability, the adjustment of the win expectancy for home teams, or the calculation of the expected goals”.

After 100,000 iterations of his simulator, David identified the following outcome (% is the number of times a team had a certain rank in the tournament):

Final

David points out that the simulation followed the official Tournament rules, thus the resulting distribution for each team takes into account Tournament specif c attributes such as the possibility of meeting stronger opponents in most of the matches. He adds “there are four clear favourites for the first place: Brazil, Spain, Argentina and Germany”.

Winning with or against the odds?

2306287069_6f1d7192ab_b

Three different models have identified Brazil as the probable winners of the 2014 World Cup. Two agree on the four semi-finalists. Andrew Yuan has Portugal in his four ahead of Argentina.

What will be fascinating is whether any team can outperform their ‘destiny’.

The ELO Ratings on 29 May 2014 were:

ELO

From an analysis perspective any performance that overcomes these probabilities will be great to examine in detail.

Ray Stefani has provided some additional food for thought. He has looked at FIFA Rankings as his guide and has examined the performance of the top four ranked teams going into each Tournament since 1994.

Rank matrix

Ray’s summary is “No top rated team won, presenting a challenge for Spain in 2014. The second-rated teams won twice, good news for Germany. The fourth-rated teams were second twice, with host nation Brazil currently being fourth. Brazil is particularly well placed to win, given a host-nation advantage”.

France was ranked 18th in the world when it won as hosts in 1998.

This is the betting market on Oddschecker on 29 May for teams to reach the World Cup Final:

Odds

There is some fascinating ensemble convergence here. Can any team outside the five identified find the momentum to win the World Cup away from home?

Photo Credit

World Cup (Marcel Canfield, CC BY-NC 2.0)

Brazil World Cup 1982 (Oyosan, CC BY-ND 2.0)

Postscript

Simon Gleave alerted me to Infostrada’s predictions for the World Cup. Infostrada “is developing various methodologies to forecast major sporting tournaments by implementing various techniques. The methodology we have used to forecast the Football World Cup is based on the Elo rating system”.

This approach:

  • Is based on all historical match results from all teams.
  • Updates the rating after each match to show current strength.
  • Teams gain points when winning and lose points when losing.
  • More points are gained for beating stronger opponents.

The outcome of Infostrada’s analysis of the knockout phase is:

KnockoutResults1

Infostrada will be updating their model throughout the Tournament.

I note that the closest game in the final sixteen games is Uruguay v Columbia. The winner of this game meets Brazil. The four semi finalists are Brazil, Germany, Spain and Argentina.

Other sources of information

Simon Gleave’s Twitter observations on Goldman Sachs (and links to other criticisms).

James Grayson posted this response to the Goldman Sachs model on 29 May.

Calcio Cassini has looked at Examining World Cup Conventional Wisdom on 28 May.

Steve Fenn’s Twitter observations.

 

 

Predictive Analytics

iSportConnect and Paper.Li brought me two predictive analytics stories this morning.

The iSportConnect link shared news of the Rugby Football Union’s partnership with IBM.

IBM has become the Official Analytics Partner for the RFU and  “will implement an analytics solution to provide fans with real-time insights into the game, including information about individual performance by players – the IBM TryTracker”.

IBM’s Predictive Analytics software “will analyse historic and current rugby data provided by Opta” and aims to “give viewers access to insights that will heighten their understanding of what to watch for in each game and explain what needs to be done to increase the likelihood of a team win against specific opponents”.

K2G

The IBM TryTracker will include the ‘Keys to the Game’,  that will “provide play-by-play insights during the game, and predict three crucial areas of performance specific to each team ahead of match day”. The data for the Tracker will be collected by Opta for all England internationals and will be analysed by IBM, before being hosted on RFU.com.

The platform will also:

  • Visualise ‘Momentum’
  • Identify ‘Key Influencers’

IBM’ service builds on work developed in tennis tournaments. (I posted about the Wimbledon SlamTracker last year.)

Paper.Li brought news that “researchers have created software that predicts when and where disease outbreaks might occur based on two decades of New York Times articles and other online data”. An MIT Technology Review post by Tom Simonite provided details of the prototype software.

  • It uses 22 years of New York Times archives (1986-2007)
  • Draws on data from the Web to learn about what leads up to major news events (including DBpedia, WordNet, and OpenCyc)

This blend of resources supports the development of general rules for what events precede others.

The post highlights another a startup company, Recorded Future that makes predictions about future events “harvested from forward-looking statements online and other sources”. In a post about the company last December, Tom Simonite reported that search results “are compiled using a constantly updated index of ‘streaming data’, including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches”.

Recorded Future uses linguistic algorithms to identify specific types of events and can track the overall tone that news coverage and blog entries take.  (A video about Recorded Future.)

RF

Opta Data as a Secondary Data Source

I have written a number of posts about the availability of web sites that provide secondary data to analyse performance in sport.

I have another source to share.

Earlier this week Sachin Nakrani of the Guardian UK posted a story about Opta.

Opta was formed in 1996 “by a group of friends who used to log match statistics while watching play unfold at their local pub”. Sachin reports that:

the company is now global, collecting data on 30 sports across 70 countries from their nine offices in Europe, the United States and Australia. The hub of the operation is in London, with Opta’s head office, located in spitting distance of Waterloo station, a hive of activity and anticipation for the new Premier League season. Once again it will be they who stock newspaper, radio and television coverage of top-flight fixtures with statistics, covering everything from goals scored to total flick-ons made by the losing team’s centre-forward.

The company offers Data Scout and Video Scout services. The Video Scout service draws on the 25,000 video record’s in the company’s archive.

I like Opta’s tag line:

Photo Credit

Data Centre