the NCAA is migrating 80+ years of historical and play-by-play data, from 90 championships and 24 sports, to Google Cloud Platform
One of the first activities planned was to explore basketball data in preparation for the NCAA’s Women’s and Men’s Division I Basketball Tournaments held in March and April 2018 (March Madness).
More information about the partnership appeared in two posts on 30 March 2018. In the first post, Courtney Blacker reported a month’s-long experiment “to apply Google’s technologies to the NCAA’s treasure trove of data”.
We assembled a team of technicians, data scientists, and basketball enthusiasts (we call them ‘The Wolfpack’) who built a data processing workflow using Google Cloud Platform technologies like BigQuery and Cloud Datalab.
The aim of this approach was “to build models that look at influential factors on team performance”. During the tournaments, the Google Cloud team planned to “use our workflow to analyze our observations from the first half of each game against NCAA historical data to hone in on a stat-based prediction for the second half that we think is highly probable”. These predictions would be presented as a television advert during the half time break.
An example from the Kansas v Villanova semi-final game:
The video suggested there would be at least 26 assists in the second half (there were 28) and 55 shot attempts (there were 64). (In the second semi final, Michigan v Loyola-Chicago, the predictions were for 37 three-point attempts (there were 38) and 29 rebounds (there were 29).
The final had these suggestions:
The second post, written by Eric Schmidt and Allen Jarvis, about the Google Cloud and NCAA partnership, provided a detailed account of the architecture to support the data analysis that was occurring. This illustrated “the importance of proper tooling to enable collaboration across multiple disciplines, including data engineering, data analysis, data science, quantitative analysis, and machine learning”.
The architecture for this service requires:
- A flexible and scalable data processing workflow to support collaborative data analysis.
- New analytic explorations through collaboratively developed queries and visualizations.
- Real-time predictive insights and analysis related to the games, modeled around NCAA men’s and women’s basketball.
Eric and Allen go through each of these points at length. Their account indicates what is becoming available to sport as we explore archives for insights.
They have an important message in their conclusion:
… better data preparation means better data analysis. Many organizations imagine diving in directly to predictive modeling without a critical examination of their data or existing analytic frameworks. If the greatest value is to be found in predictive insights, followed by analysis, supported by clean but raw data, you can imagine the amount of work required to get there as the inverse: a lot of data preparation that paves the way for better analysis, which in turn clears a path for good modeling.
The ball is in all our courts.