Data Fragments

The Centre Pompidou in Paris. People crossing the street with data in the p[icture.

I have managed to read three of the tantilising feeds I received yesterday.

The first was by Prateek Karkare on Decision Trees (link). I found his intuitive introduction very helpful. He started with some binary decision examples then moved on to classification, regression, and learning.

The second was Scott Berinato’s Data Science and the Art of Persuasion (link). In it, Scott observes that organisations:

still expect data scientists to wrangle data, analyze it in the context of knowing the business and its strategy, make charts, and present them to a lay audience. That’s unreasonable.

He proposes “rethinking how data science teams are put together, how they’re managed, and who’s involved at every point in the process, from the first data stream to the final chart shown”.

Scott explores a last mile problem that has existed for a century (“As the cathedral is to its foundation so is an effective presentation of facts to the data”) (link). Scoot concludes that a better data science operation environment needs:

  • A definition of talents rather than team members (management, wrangling, analysis, domain expertise, design, storytelling)
  • Create a portfolio of talents
  • Share experiences and insights
  • Structure projects around talents

With this approach in place:

  • Assign a single, empowered stakeholder
  • Assign leading talent and support talent
  • Co-locate
  • Reuse and template

The third read was Susan Grajek’s The Student Genome Project (link). In her introduction, she observes:

In 2019, after a decade of preparing, colleges and universities stand on a threshold, eager to enter a new era of using technology to unlock our ability to apply data to advancing our missions. That threshold is similar to the one that science faced in the late 20th century: eager to begin using technology to put genetic information to use.

I thought this would resonate powerfully with sport contexts too. Note Susan’s point “We have a growing belief in the value and power of data to understand root causes and improve advice, decisions, and outcomes”.

This resonated very powerfully with me:

our sector faces a daunting preliminary task: we must understand the component parts (find the data, clean it, standardize it, safeguard it); integrate and manage those parts; and find the right tools for these tasks. Just as the big challenge facing genetics in the 1990s was foundational, so is the big challenge that confronts higher education and technology today. After almost a decade of attention and effort, we find ourselves still at the beginning of the data journey—needing to, in effect, “sequence” the data before we can apply it with any reliability or precision.

They are three data fragments but together they have provided me with another delightful day of exploration. I note them her as part of my learning portfolio.

Photo Credit

Photo by Curtis MacNewton on Unsplash

#AFLW 2019

The 2019 AFLW season starts on Saturday with the opening game between Geelong and Collingwood (link to fixtures).

I have some data from last year’s regular season (link) curated as secondary data from the official AFLW web site (link).

Median Profiles

A Violin Plot created with BoxPlotR (link). (W1Q is the winning team, L1Q is the losing team).

Plot information

These data have given me an opportunity to postulate some naive priors about when points will be scored in the 2019 season. The probabilities per quarter are based upon game outcome so that the labels ‘winning’ and ‘losing’ relate to the game not the quarter.

Trying gganimate

The gganimate CRAN package “extends the grammar of graphics as implemented by ggplot2 to include the description of animation. It does this by providing a range of new grammar classes that can be added to the plot object in order to customise how it should change with time”. (Link)

I have been trying out the package’s functionality with some WBBL cricket data. I followed the very basic guidelines to produce four gifs.

I am keen to explore this package in more detail. Before I started I used the vis_dat package (link) to check the structure of my data:

My very basic code for the animations was:

ggplot(df, aes(Wicket, WTPartnership03, size=WTPartnership03)) +
scale_x_continuous(breaks=1:10)+
geom_point(alpha = 0.7, show.legend = FALSE, colour=”blue”) +
labs(title = ‘Wicket Winning Team Partnerships WBBL03: {frame_time}’,
x = ‘Wicket’, y = ‘Proportions of Total Runs Scored Per Wicket WBBL03’) +
transition_time(Wicket) +
ease_aes(‘linear’)


I used anim_save(“~/Documents”) to save the animation. This command saves the most recent animation as a gif to your specified directory. I had some problems with my x axis so tried scale_x_continuous(breaks=1:10) to resolve the issue and provide me with an integer count for the header in the gif.