2014 : WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT? [1]

gary_marcus's picture [5]
Professor of Psychology, Director NYU Center for Language and Music; Author, Guitar Zero
Big Data

No, I don't literally mean that we should stop believing in, or collecting, Big Data. But we should stop pretending that Big Data is magic. There are few fields that wouldn't benefit from large, carefully collected data sets. But lots of people, even scientists, put more stock in Big Data than they really should. Sometimes it seems like half the talk about understanding science these days, from physics to neuroscience, is about Big Data, and associated tools like "dimensionality reduction", "neural networks", "machine learning algorithms" and "information visualization".

Big Data is, without a doubt, the idea of the moment. 39 minutes ago (according to the Big Data that drive Google News), Gordon Moore (for whom Moore's law is named) "Gave Big to Big Data", MIT debuted an online course for Big Data (44 minutes ago), and Big Data was voted strategy+businesses' Strategy of the year. Forbes had an article about Big Data a few hours before that. There were 163,000 hits for a search for big+data+science.

But science still revolves, most fundamentally, around a search of the laws that describe our universe. And the one thing that Big Data isn't particularly good at is, well, identifying laws. Big Data is brilliant at detecting correlation; the more robust your data set, the better chance you have of identifying correlations, even complex ones involving multiple variables. But correlation never was causation, and never will be. All the big data in the world by itself won’t tell you whether smoking causes lung cancer. To really understand the relation between smoking and cancer, you need to run experiments, and develop mechanistic understandings of things like carcinogens, oncogenes, and DNA replication. Merely tabulating a massive database of every smoker and nonsmoker in every city in the world, with every detail about when they smoked, where they smoked, how long they lived, and how they died would not, no matter how many terabytes it occupied, be enough to induce all the complex underlying biological machinery.

If it makes me nervous when people in the business world put too much faith in Big Data, it makes me even more nervous to see scientists do the same. Certain corners of neuroscience have taken on an "if we build it, they will come" attitude, presuming that neuroscience will sort itself out as soon as we have enough data.

It won't. If we have good hypotheses, we can test them with Big Data, but Big Data shouldn't be our first port of call; it should be where we go once we know what are looking for.