My title quotes Richard Feynman, and I am using his words to express how the Internet is providing not only information about our world, but also making available the means to understand it in a deep sense. The increased use of the computer in scientific research, from simple data analysis to simulations, means the ability to recreate and verify facts for oneself is very real, as scientists can release the complete software environment and data required to reproduce their results on the Internet. The Internet is opening this possibility to society at large for the first time. If our home computing power or disk space is insufficient, the Internet connects us to massive computing power such as the Teragrid or the cloud. We are posed to empower our own decision making through Internet-based verification of what we believe, important for self-determination but also for the validity of the computational results themselves. The result is a change in how I expect to understand the world.
Data analysis has risen as an intellectual force of its own, with implications for how we accept new knowledge as facts. In 1962 John Tukey first proposed data analysis as a field in its own right and split the field of statistics in two. At that time, statistics was synonymous with mathematical analysis and the Information Age was only just beginning. Tukey foresaw the coming data deluge and that the traditional machinery of mathematical statistics, such as hypothesis testing and confidence statements, had relatively little to offer for these new problems. There was an enormous amount of analysis to be done on vast amounts of data, and insisting on mathematics ran the risk of missing important findings. Now, data analysis is presenting challenging mathematical questions and we are running that same risk in reverse.
When awash in data it is common to use the following three-step investigative method: a new phenomenon is found in the data, followed by an analysis strategy justified on heuristic grounds, and then some computational examples of apparent success are provided. This approach makes it nearly impossible to derive the deeper intellectual understanding that the mathematical framework is geared to uncover. Our basic tools of modern data analysis, from regression to principal components, were developed by scientists working squarely in the mathematical tradition, and are based on theorems and analysis. As the Internet facilitates a national hobby of data analysis, our thinking about scientific discovery is no longer typically in the intellectual tradition of mathematics. This tradition, and the area of my training, defines a meaningful investigation as involving a formal definition of the phenomenon of interest, stated carefully in a mathematical model, and use of a strategy for analysis that follows logically from the model. It is accompanied at every step by efforts to show how the opportunity for error and mistakes has been minimized. As data analysts we must have the same high standards for transparency in our findings, and consequently I am pushing my thinking toward deeper intellectual rigor, more in line with the mathematical tradition and less in line with the data analysis tradition so facilitated by the Internet.
Mathematics has been developing responses to the ubiquity of error for hundreds of years, resulting in formal logic and the mathematical proof. Computation is similarly highly error-prone, but recent enough to still be developing equivalent standards of openness and collective verification. An essential response is reproducibility of results: the release of code and data that generated the computational findings we'd like to consider as a contribution to society's stock of knowledge. This subjects computational research to the same standards of openness as filled by the role of the proof in mathematics.
The Internet has changed how I think about science, and how to identify it. Today most computational results aren't accompanied by their underlying code and data, and my opening description of being able to recreate results for oneself is not commonplace. But I believe this will become typical - the draw of verifying what we know for ourselves and being less reliant on the conclusions of others has remained evident in our long search for truth about our world. This seems a natural evolution from a state of knowledge derived from mystical sources with little ability to question and verify, through a science-facing society still with an epistemological gulf between scientist and non-scientist. Now, the Internet allows more of our understanding to seep from the ivory tower, closing that gulf and empowering us to know things for ourselves and changing our expectations about what it means to live in an open, data-driven, society.