2014 : WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT?

charles_seife's picture
Professor of Journalism, New York University; Former Journalist, Science Magazine; Author, Hawking Hawking
Statistical Significance

It's a boon for the mediocre and for the credulous, for the dishonest and for the merely incompetent. It turns a meaningless result into something publishable, transforms a waste of time and effort into the raw fuel of scientific careers. It was designed to help researchers distinguish a real effect from a statistical fluke, but it has become a quantitative justification for dressing nonsense up in the mantle of respectability. And it's the single biggest reason that most of the scientific and medical literature isn't worth the paper it's written on.

When used correctly, the concept of statistical significance is a measure to rule out the vagaries of chance, nothing more, nothing less. Say, for example, you are testing the effectiveness of a drug. Even if the compound is completely inert, there's a very good chance (roughly 50%, in fact) that patients will respond better to your drug than to a placebo. Randomness alone might imbue your drug with seeming efficacy. But the more marked the difference between the drug and the placebo, the less likely it is that randomness alone is responsible. A "statistically significant" result is one that has passed an arbitrary threshold. In most social science journals and the medical literature, an observation is typically considered statistically significant if there's less than a five percent chance that pure randomness can account for the effect that you're seeing. In physics, the threshold is usually lower, often 0.3% (three sigma) or even 0.00003% (five sigma). But the essential dictum is the same: if your result is striking enough so that it passes that threshold, it is given a weighty label: statistically significant.

Most of the time, though, it isn't used correctly. If you look at a typical paper published in the peer-reviewed literature, you'll see that never is just a single observation tested for statistical significance, but instead handfuls, or dozens, or even a hundred or more. A researcher looking at a painkiller for arthritis sufferers will look at data to answer question after question: does the drug help a patient's pain? With knee pain? With back pain? With elbow pain? With severe pain? With moderate pain? With moderate to severe pain? Does it help a patient's range of motion? Quality of life? Each one of these questions is tested for statistical significance, and, typically, gauged against the industry-standard 5% rule. That is, there's a five percent chance—one in twenty—that randomness will make a worthless drug seem like it has an effect. But test ten questions, and there's a 40% chance that randomness, will, indeed, deceive you when answering one or more of these questions. And the typical paper asks more than ten questions, often many more. It's possible to correct for this "multiple comparisons" problem mathematically (though it's not the norm to do so.) It's also possible to fight this effect by committing to answer just one main question (though, in practice, such "primary outcomes" are surprisingly malleable.) But even these corrections often can't take into account numerous effects that can undermine a researcher's calculations, such as how subtle changes in data classification can affect outcomes (is "severe" pain a 7 or above on a 10-point scale, or is it an 8 or above?) Sometimes these issues are overlooked; sometimes they're deliberately ignored or even manipulated.

In the best-case scenario, when statistical significance is calculated correctly, it doesn't tell you much. Sure, chance alone is (relatively) unlikely to be responsible for your observation. But it doesn't reveal anything about whether the protocol was set up correctly, whether a machine's calibration was off, whether a computer code was buggy, whether the experimenter properly blinded the data to prevent bias, whether the scientists truly understood all the possible sources of false signals, whether the glassware was properly sterilized, and so forth and so on. When an experiment fails, it's more than likely that the blame doesn't rest on randomness—on statistical flukes—but instead on a good old-fashioned screwup somewhere.

When physicists at CERN claimed to have spotted neutrinos moving faster than light, a six-sigma level of statistical significance (and an exhaustive check for errors) wasn't enough to convince smart physicists that the CERN team had messed up somehow. The result clashed not only with physical law, but with observations of neutrinos coming from supernova explosions. Sure enough, a few months later, the flaw (a subtle one) finally emerged, negating the team's conclusion.

Screwups are surprisingly common in science. Consider, for example, the fact that the FDA inspects a few hundred clinical laboratories each year. Roughly 5% of inspections come back with findings that the laboratory is engaged "significant objectionable conditions and practices" so egregious that its data are considered unreliable. Often these practices include outright fraud. Those are just the blindingly obvious problems visible to an inspector; it would be hard to imagine that the real number of lab screwups aren't double or triple or quintuple that. What value is it to call something statistically significant at the 5% or 0.3% or even 0.00003% level if there's a 10% or 25% (or more) chance that the data is gravely undermined by a laboratory error? In this context, even the most iron-clad findings of statistical validity lose their meaning when dwarfed by the specter of error or, worse yet, fraud.

Nevertheless, even though statisticians warn against the practice, it's all too common for a one-size-fits-all finding of statistical significance to be taken as a shortcut to determine if an observation is credible—whether a finding is "publishable." As a consequence, the peer-reviewed literature is littered with statistically significant findings that are irreproducible and implausible, absurd observations with effect sizes orders of magnitude beyond what is even marginally believable.

The concept of "statistical significance" has become a quantitative crutch for the essentially qualitative process of whether or not to take a study seriously. Science would be much better off without it.