David Glance has a fascinating post about garage biotech in The Conversation today.
In the post, David reports on a talk given by Atul Butte in which he suggested:
like other Silicon Valley startups, almost anyone can bring a drug to market from their garage with just a computer, the internet, and freely available data.
David points to the amount of genetic data online to support this innovation. He concludes that “none of this would be possible without the sharing of data” and that “the growth of availability of open research data will be able to fuel a range of uses that would not have been foreseen when the individual experiments were being carried out”.
Secondary Data Analysis
Twenty-two years before Larry Page and Sergey Brin moved into the garage in Menio Park, Gene Glass wrote a short paper on primary, secondary and meta-analysis of data.
He proposes that secondary analysis:
is the re-analysis of data for the purpose of answering the original research question with better statistical techniques, or answering new questions with old data. (1976:3)
He adds “some of our best methodologists have pursued secondary analysis in such grand style that its importance has eclipsed that of the primary analysis” (1976:3).
Gene was writing in a pre-internet age where the archiving of primary data was a local task subject to the curation habits of individual researchers. His discussion of secondary and meta-analysis of data argued for “more scholarly effort concentrated on the problem of finding the knowledge that lies untapped in completed research studies”.
An indication of how times have changed since his paper, can be found in Nathan Bean and colleagues’ (2016) use of an extant longitudinal high school study. They note some of the advantages of accessing these data but point to limitations as well. These include:
- the need to apply weights to ensure statistical analyses were generalizable to the national population and that standard errors were adjusted for the complex sampling design
- the items included in the original study
- missing data
- anonymised data and the use of a data redacted codes
They do not mention data formats explicitly but this is an issue too as we start to share data files as part of research conversations.
The authors conclude:
Large data sets like HSLS:09 provide a veritable wealth of data that can answer a broad range of questions – were the combined analytical powers of the nations’ education graduate student population applied to these data sets, the pace and significance of research, both original and conformational, could be vastly increased. Given the slow pace of educational research (limited by resources, researchers, and the growth and learning rates of participants), this kind of crowd-sourced research effort offers the opportunity to accelerate research efforts through parallel inquiries.
Furthermore, this study also prepared four graduate student researchers to conduct future quantitative research efforts, gave them hands-on experience in statistical analysis, and helped them to see the challenges and limitations of such studies. The benefits from this extra depth are two-fold: For those students going into future research, this experience was clearly valuably preparatory and helped establish them as published researchers. For those who intend to return to educational practice, it helped them to understand the role of research in their field, as well as how to evaluate and understand research findings so as to better apply them within their area of practice.
I see the growing availability of performance data and the sharing of these data in sport contexts as opportunities to induct students into performance analysis and analytics.
I do think this process does require some detailed accounts of methodologies by those who collect the primary data. (See, for example Natalie Koziol and Ann Arthur’s (2011) recommendations.)
Just as I was concluding my reading of David Glance’s has post about garage biotech, I received an alert to Jeffrey Pomerantz and Robin Peek’s (2016) discussion of the different uses of ‘open’ in the literature. They note:
it has been the twenty-first century that has seen the most dramatic increase in the number of terms that use “open.” The story of this explosion in the use of the word “open” begins, however, with a different word entirely: the word “free.”
They propose that the use of the word ‘open’ means:
- enabling openness
- philosophically aligned with open principles
Jeffrey and Robin make this point in their conclusion:
The user of an open resource is free to do with it what they like, which may include creating a new resource, which another user may be free to do with what they like, etc. Openness creates a virtuous cycle.
In a digital age of open sharing, each of us has an opportunity to become a produser within this virtuous cycle.
As I have explored some of the open opportunities for performance analysis and analytics, I have found Github to be a shining example of how transparent, open sharing works.
The availability of data for secondary analysis in increasing at a remarkable pace. We have more opportunities than ever to explore data at two removes. This does raise very important questions about the validity and reliability of primary source data. Openness does not mean we abandon the scrutiny we give to data. Perhaps it can increase our methodological sensitivity.
I am attracted to Nathan Bean and colleagues’ approach to inducting young researchers into the processes of secondary data analysis. I hope that as we reap the benefits of open sharing with the meanings identified by Jeffrey and Robin we can also contribute stories about how this produsing exists as practice.
1998 Garage days (Fortune, Google Turns 10)