We Need a GitHub for Academic Research

0
19

MihaPater/Thinkstock

Communicating the results of scientific studies remains rooted in printing presses and elegant typography.

The internet has profoundly affected every aspect of our lives—how we shop, how we bank, how we get our news, how we learn to samba.

One striking exception to this pattern is the way that academic scientists report the results of new research. As they have for centuries, scientists continue to write papers that summarize the results of their work and then submit them to scholarly journals for potential publication. Readers of these journals, for the most part, are other working scientists. The more prestigious the journal is, the better that is for the scientist’s career advancement prospects. The paper serves as the official and complete account of a given research effort, which researchers note in their curricula vitae as their chief credentials for advancement. No papers, no employment. Communicating the results of scientific studies remains rooted in printing presses and elegant typography.

This is a shame because the academic paper has some inherent limitations—chief among them that it can provide only a summary of a given research project. Even an outstanding paper cannot provide direct access to all of the research data collected or to the record of discussions among scientists that is reflected in lab notes. These windows into the messy and halting process of science, which can be extremely valuable learning objects, are not yet part of the official record of a research study.

But it doesn’t have to be this way. If we take advantage of the unique capabilities of the web to tell the full story of a research project—rather than merely using it as a faster printing press as we do today—we can build greater transparency into our approach to reporting science. Besides improving information-sharing among scientists, a push toward transparency could improve public trust in science and scientists. Now, when the very concepts of fact and truth under assault and many scientists feel compelled to march in response, is the perfect time to rethink our approach to scientific communication altogether.

Not so fast, skeptics might say. It’s true that scientific publishing as we know it today has a long and storied history; the first scholarly journal in the West, the Philosophical Transactions of the Royal Society of London, appeared in 1665. And it is thanks to scientific journals that we have gained some of humanity’s most important knowledge.

On the other hand, scholarly journals have also reported frauds. One case in point is Andrew Wakefield’s false assertion of a link between vaccines and autism in children. Wakefield published this report in 1998, based on fabricated and unethically obtained data. The journal the Lancet fully retracted it in 2010, but even today this false claim lives on.

The Wakefield case is a blockbuster, an example of blatant and protracted malfeasance. Other researchers cut corners in more subtle, less overtly malicious ways. The website Retraction Watch tracks scientific claims that end up being retracted due to falsified data or methodological errors. Retractions have been on the rise in recent decades. This is true in extremely established and prestigious journals, as well as in fly-by-night “journals” that have cropped up online.

False reporting means that reproducing the claimed results—that is, another researcher attempting to replicate the study’s methods to see if he or she obtains the same results—becomes much harder. Such careful validation is a hallmark of the scientific method, and it depends on having access to accurate and complete data.

Concern with reproducibility has increased in recent years. In 2015, an international research team led by Brian Nosek of the Center for Open Science argued that reproducibility is a challenge for research in cognitive and social psychology. They drew this conclusion after making a concerted effort to reproduce the findings of 100 prior studies and only being able to reproduce 39 percent of the original results. This effort is known as the Reproducibility Project.

Nosek does not claim that all researchers commit blatant fraud in the Wakefield style. Most people have a stronger moral compass than that. The trouble, according to Nosek, is the pressure to generate “novel results” that increase chances of publication. This is easier to do if only the members of the research team, and nobody else, have full access to the underlying, unfiltered data that drive their conclusions. Nosek’s work aims to bring the entire record of a scientific project into public view.

Researchers at the Reproducibility Project provide for maximum transparency in their own work. The team provides complete and open access to every product it created—all of their statistical scripts, a detailed protocol for completing study reproduction attempts, and the full record of every study reproduction attempted.

It would have been much easier, and in keeping with current norms, to publish a paper that provided just a fraction of the material the Reproducibility Project has produced. Instead, they offered it all. Their fully transparent approach provided critics—who challenged the team’s data, methods, and conclusions—with all of the evidence needed to make their case. Nosek responded gamely to such critiques, and the debate continues apace.

A GitHub for science would emphasize the preliminary and evolving nature of the data, and of scientific understanding itself.

The Reproducibility Project’s conclusions may or may not be correct. The critical point is that all of the evidence is available, which is the best way to facilitate comprehensive understanding of any topic at hand. Given this philosophical commitment, it is not surprising that Nosek contributed to the “Manifesto for Reproducible Science” that appeared earlier this year.

Enter “GitHub of Science,” an idea proposed in 2014 by biological engineer Marcio von Muhlen. Launched in 2008, GitHub has become the world’s leading repository of open-source computer code. Open-source code can be freely accessed and developed by any software developer, spurring continuous iterative development. GitHub has roots in the Linux operating system, which is also open source. Leading technology companies such as Facebook and Twitter now host their code for open-source projects on GitHub, which is one of the 100 most visited sites on the web.

Von Muhlen’s proposal focused on using the social web to quickly reward innovative scientists, using GitHub as a model. A full GitHub for science could go even further, focusing on increasing transparency to improve reproducibility. In a GitHub for science, each “paper” that researchers produce would reflect the complete and full record of an experiment—every lab note, every statistical script, every audio file, and every bit of computer code. To the greatest extent possible, this evidence would be shared in real time. The research process is rife with trial and error, and it’s not as linear as the version of events recorded in a paper. A GitHub for science would emphasize the preliminary and evolving nature of the data, and of scientific understanding itself.

This complete record of research would also facilitate new work much more seamlessly than occurs today. As it stands, most new research is built entirely on the summary of earlier work that is contained in a published paper. As the experience of the researchers associated with the Reproducibility Project shows, authors do sometimes provide access to their data files upon request. Even so, the very necessity of making such a request is an unnecessary, archaic barrier. The “paper” (other more modern terms are welcome) can and should evolve into a guide to the evidence accumulated and no longer serve as a complete statement of work.

The trouble is that there is currently no incentive for researchers to share their data widely. Indeed, the opposite is quite often true. Fear of being scooped, whether or not justified, causes researchers to guard their data closely.

It will take a while for this cultural shift to catch on—if it ever does. Technological possibility does not inevitably prevail over systemic inertia. After all, universal, ubiquitous electronic medical records that are linked across different health care systems have been in discussion since the 1960s and are still far from reality. There is no doubt that scientific research as a whole would benefit from greater transparency and openness. The trouble is that individual researchers perceive that they stand to lose from becoming more open, given the prevailing incentives, at least in the short term. This is a variation on the now familiar (albeit still depressing) problem of the “tragedy of the commons.”

The only people with the power to change this situation are scientists themselves. They set the terms for what counts as appropriate evidence for their fields and enforce the norms regarding how research is shared. Here we find some glimmers of hope. In addition to the pioneering efforts of the Reproducibility Project, many scientists are acutely concerned about the perverse incentive to publish novel results regardless of the actual evidence. Often these scientists are early in their careers and still climbing the academic ladder. Once they are in leadership roles themselves, they will be able to change incentives as a way to change behavior.

It will be worth the wait. A GitHub for science would facilitate the entire purpose of scientific discourse, which is to engage in dialogue based on a complete appreciation of the topic under discussion. If scientists choose to build it, we will all benefit from an improved and evolving understanding of the world.

This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, follow us on Twitter and sign up for our weekly newsletter.

Eoghanacht/Wikipedia

The Elbert P. Tuttle U.S. Courthouse, in Atlanta.