I have been following up on some leads shared by Mara Averick. Two recent suggestions caught my attention as I try to improve the ways I share and connect.
The first was a post by Joris Muller about reproducible computational research for R users. In it he explores ideas shared in a 2013 paper written by Geir Sandve and colleagues. In that paper, Geir proposes ten rules for reproducible computational research. These are very pertinent to those seeking to share and explore performance in sport using analytics insights.
The ten rules are:
- Keep track of how every result was produced.
- Avoid manual data manipulation steps
- Archive the exact versions of all external programs used
- Version control all custom scripts
- Record all intermediate results in standardised formats when possible
- For analyses that include randomness note underlying random seeds
- Always store raw data behind plots
- Generate hierarchical nalysis output allowing layers of increasing detail to be inspected
- Connect textual statements to underlying results
- Provide public access to scripts, runs and results
Joris concludes his post:
All the 10 rules proposed in the Sandve paper are reachable for a R user. Just by using R itself, the rmarkdown workflow and some organisational rules cover most of these rules. My basic reproductible workflow meet almost all the criterias with the notable exceptions of the software archive (but it’s work in progress with packrat) and the lack of public access (but I can’t share everything).
For an introduction to Joris’s workflow, you might find this post of interest.
The second lead from Mara focussed on reproducible behaviour too. Jenny Bryan shared her ideas back in 2015 about Naming Things. This is one of the many resources Jenny has shared. I have found her GitHub repositories immensely helpful. In her 2015 paper, Jenny notes three principles for file names: machine readable, human readable and ‘plays well with default ordering’.
The two leads sent me off thinking about how I might improve my practice. I am fascinated by Joris’s transparency with his workflow and I see this approach as essential for sport analytics as we start to extend cumulative rather than ‘ab initio‘ research. I admire Jenny’s work immensely. I have tried to use some robust file naming conventions for the past fifteen years as I have sought to use cloud based storage for all my resources. I realise I am a long way from meeting Jenny’s three principles at the moment but this will be a work in progress.
Mara Averick’s Twitter recommendations are becoming a very important way for me to connect with a community of practice. These two leads discussed here are a way for me to make this process explicit … and to initiate a conversation about reproducible behaviours in sport analytics research and practice.
Photo Credits
Tree on campus (Keith Lyons, CC BY 4.0)