Basketball: archives and insights

On 19 December 2017, Google Cloud announced that it had become the official cloud partner for the National Collegiate Athletic Association (NCAA).

In the announcement, it was reported that:

the NCAA is migrating 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform

One of the first activities planned was to explore basketball data in preparation for the NCAA’s Women’s and Men’s Division I Basketball Tournaments held in March and April 2018 (March Madness).

More information about the partnership appeared in two posts on 30 March 2018. In the first post, Courtney Blacker reported a month’s-long experiment “to apply Google’s technologies to the NCAA’s treasure trove of data”. 

We assembled a team of technicians, data scientists, and basketball enthusiasts (we call them ‘The Wolfpack’) who built a data processing workflow using Google Cloud Platform technologies like BigQuery and Cloud Datalab.

The aim of this approach was “to build models that look at influential factors on team performance”. During the tournaments, the Google Cloud team planned to “use our workflow to analyze our observations from the first half of each game against NCAA historical data to hone in on a stat-based prediction for the second half that we think is highly probable”. These predictions would be presented as a television advert during the half time break.

An example from the Kansas v Villanova semi-final game:

The video suggested there would be at least 26 assists in the second half (there were 28) and 55 shot attempts (there were 64). (In the second semi final, Michigan v Loyola-Chicago, the predictions were for 37 three-point attempts (there were 38) and 29 rebounds (there were 29).

The final had these suggestions:

The second post, written by Eric Schmidt and Allen Jarvis,  about the Google Cloud and NCAA partnership, provided a detailed account of the architecture to support the data analysis that was occurring. This illustrated “the importance of proper tooling to enable collaboration across multiple disciplines, including data engineering, data analysis, data science, quantitative analysis, and machine learning”.

The architecture for this service requires:

  1. A flexible and scalable data processing workflow to support collaborative data analysis.
  2. New analytic explorations through collaboratively developed queries and visualizations.
  3. Real-time predictive insights and analysis related to the games, modeled around NCAA men’s and women’s basketball.

Eric and Allen go through each of these points at length. Their account indicates what is becoming available to sport as we explore archives for insights.

They have an important message in their conclusion:

… better data preparation means better data analysis. Many organizations imagine diving in directly to predictive modeling without a critical examination of their data or existing analytic frameworks. If the greatest value is to be found in predictive insights, followed by analysis, supported by clean but raw data, you can imagine the amount of work required to get there as the inverse: a lot of data preparation that paves the way for better analysis, which in turn clears a path for good modeling.

The ball is in all our courts.

Photo Credits

March Madness 2009 (Andy Thrasher, public domain)

Gators are in the Final Four (Courtland, CC BY-NC-ND 2.0)

Stephen Maxwell Corey

Stephen Corey was the co-author with Lloyd Messersmith of the 1931 paper The distance traversed by a basketball player. At that time, Stephen was a lecturer in the Department of Psychology at De Pauw University. He had received his PhD in 1928 at the University of Illinois.

Stephen and Lloyd were of similar ages, Stephen was born in 1904 and Lloyd in 1905. At the time of the publication of their paper (in volume 2 of the Research Quarterly) they were what we call today ‘early career’ university teachers.

The paper reports the distances traversed in a whole game by the De Pauw University floor guard in a game against Miami University. Lloyd’s pursuit apparatus for measuring distances traversed required an assistant to record sounds emanating from the tracing wheel used. I imagine Stephen provided that service and kept a record of change of possessions in the game (p.59).

I am keen to introduce Stephen as part of this story. I see the paper as a seminal moment in the start of the notational analysis of performance as a scholarly activity. Lloyd brought his basketball teaching and coaching insights and Stephen came from a different academic background. Their paper cited no earlier references.

Stephen was appointed professor of educational psychology and superintendent of laboratory schools at the University of Chicago in 1940. Eight years later he became a member of staff at the Teachers College as executive officer of its Horace Mann-Lincoln Institute of School Experimentation. 

Stephen developed an expertise in action research and published a number of papers and books on the subject. These include:

  • 1940. The teachers out-talk the pupils. The School Review, 48(10), 745-752.
  • 1949. Action research, fundamental research and educational practices. Teachers College Record, 50(8), 509-514.
  • 1953. Action research to improve school practices. Oxford: Bureau of Publications, Teachers Co.
  • 1954. Action research in education. The journal of educational research, 47(5), 375-380.

To my knowledge, Stephen did not write another paper with Lloyd. In 1954, Stephen wrote that “action research in education is research undertaken by practitioners in order that they may improve their practices” (p.375).

Back in 1931, Lloyd was developing his skills as a coach. I am hopeful that Stephen’s interest in action research gave them lots to discuss as they both started out on their coaching, teaching and research journeys.

Both of them spent their professional lives as educators. Lloyd died in 1977 and Stephen in 1984.

Photo Credits

Corey, Stephen M. (The University of Chicago Photographic Archive [apf1-01929], Special Collections Research Center, University of Chicago Library permission to use for educational and scholarly uses)

Apparatus for measuring distance travelled by basketball players

Using BoxPlotR to Visualise Winning and Losing Profiles in the Regular #WNBL17 Season

The regular season for the WNBL concluded last Sunday.

I monitored the scores per quarter in each of the 96 games played. I was interested in the points gap between winners and losers within and between games.

My basic plot of the median difference per quarter for 96 games is:

I noted the closing of the points difference in the fourth quarter of games.  Alan Gabel and Sidney Redner (2012) in their discussion of random walk behaviours in basketball, observed:

Another intriguing feature of basketball games is that the scoring probability at any point in the game is affected by the current score: the probability that the winning team scores decreases systematically with its lead size; conversely, the probability that the losing team scores increases systematically with its deficit size.

Leto Peel and Aaron Clauset (2015) comment on this restorative pattern too.

To extend my visualisation of the performances of winning and losing teams, I returned to the excellent BoxPlotR. This application:

allows users to generate customized box plots in a number of variants based on their data. A data matrix can be uploaded as a file or pasted into the application. Basic box plots are generated based on the data and can be modified to include additional information. Additional features become available when checking that option. Information about sample sizes can be represented by the width of each box where the widths are proportional to the square roots of the number of observations n. Notches can be added to the boxes. These are defined as +/-1.58*IQR/sqrt(n) which gives roughly 95% confidence that two medians are different.

My plots for the 96 games are:





Winners and Losers Compared

This is the third time I have used BoxPlotR to visualise data. It is the first time I have used notches in the visualisation. My other attempts were to visualise the numbers of caps per country at RWC15 and goal-scoring netball performance in a test series between Australia and England in 2016.

I have found BoxPlotR an ideal application to generate box plots compared to the limited functionality available in Google Sheets where I curate my data. I did not have much success with Excel either.

If you would like to learn more about the evolution of BoxPlotR, you might find this editorial (Kick the bar chart habit) and Daniel Evanko’s blog post (Bring on the box plots) of interest.

Photo Credits

Recap of Round 19 (WNBL, Twitter)

The Last Game (WNBL, Twitter)