victoria_stodden's picture
Associate Professor of Information Sciences, University of Illinois at Urbana-Champaign

I'm not talking about retiring the abstract idea, or its place in scientific discourse and discovery, but instead I'm suggesting redefining specifically what is meant by that word and using more appropriate terminology for the different research environments scientists work within today.

When the concept of reproducibility was brought into scientific discourse by Robert Boyle in the 1660's, what comprised scientific experimentation and discovery was two-fold: deductive reasoning such as mathematics and logic; and Francis Bacon's relatively new machinery of induction. How to verify correctness was well-established in logical deductive systems at that point, but verifying experimentation was much harder.

Through his attempts with Robert Hooke to establish a vacuum chamber, Boyle made a case that inductive, or empirical, findings—those that arose from observing nature and then drawing conclusions—must be verified by independent replication. It was at this time that empirical research came to be published with sufficient detail regarding procedure, protocols, equipment, and observations such that other researchers would be able to repeat the procedure, and presumably therefore repeat the results.

This conversation is complicated by today's pervasive use of computational methods. Computers are unlike any previous scientific apparatus because they act as a platform for the implementation of a method, rather than directly as an instrument. This creates additional instructions to be communicated as part of Boyle's vision of replicable research—the code, and digital data.

This communication gap has not gone unnoticed in the computational science community and somewhat reminiscent of Boyle's day many voices are currently calling for new standards of scientific communication, this time that include digital scholarly objects such as data and code. Irreproducible computational results from genomics research at Duke University in recent years crystalized attention to this issue, and lead to a report by the Institute of Medicine of the National Academies recommending new standards for clinical trials approval for computational tests arising from computational research.

The report recommended for the first time that the software associated with a computational test be fixed at the beginning of the approval process, and thereafter made "sustainably available." A subsequent workshop at Brown University on "Reproducibility in Computational and Experimental Mathematics" (of which I was a co-organizer) produced recommendations regarding the appropriate information to include when publishing computational findings, including access to code, data, and implementation details. Reproducibility in this context should be relabeled computational reproducibility.

Computational reproducibility can then be distinguished from empirical reproducibility, or Boyle's version of the appropriate communication for non-computational empirical scientific experiments. Making this distinction is important because traditional empirical research is running into a credibility crisis of its own with regard to replication. As Nobel Laureate (and Edgie) Daniel Kahneman has noted in reference to the irreproducibility of certain psychological experiments, "I see a train wreck looming."  

What is becoming clear is that science can no longer be relied upon to generate "verifiable facts." In these cases, the discussion concerns empirical reproducibility, rather than computational reproducibility. But calling both types "reproducibility" muddies the waters and confuses discussion aimed at establishing reproducibility as a standard. I believe there is (at least) one more distinct source of irreproducibility, statistical reproducibility. Addressing issues of reproducibility through improvements to the research dissemination process is important, but insufficient.

We also need to consider new measures to assess the reliability and stability of statistical inferences, including developing new validation measures and expanding the field of uncertainty quantification to develop measures of statistical confidence and a better understanding of sources of error, especially when large multi-source datasets or massive simulations are involved. We can also do a better job of detecting biases arising from statistical reporting conventions that were established in a data-scarce, pre-computational age.

A problem with any one of these three types of reproducibility, empirical, computational, and statistical, can be enough to derail the process of establishing scientific facts. Each type calls for different remedies, from improving existing communication standards and reporting (empirical reproducibility) to making computational environments available for replication purposes (computational reproducibility) to the statistical assessment of repeated results for validation purposes (statistical reproducibility), each with different implementations. Of course these are broad suggestions, and each type of reproducibility can demand different actions depending on the details of the scientific research context, but confusing these very different aspects of the scientific method will slow our resolution of Boyle's old discussion that started with the vacuum chamber.