Big Data

In an adaption from his new book “Antifragile” Nassim Taleb describes the problem with “big data” for Wired

“Big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal).”

So more data provides more signal. That’s the whole point of doing large experiments. However, Taleb is warning us that as data size increases, the rate at which noise increases is faster than the rate at which signal increases. Thus, bigger datasets contain a worse signal-to-noise ratio than smaller datasets. This means researchers can easily find completely bogus (but statistically significant) results in big datasets. Given the ubiquity of large experiments in modern science Taleb believes: "Researchers have brought cherry-picking to an industrial level.”

Taleb could be correct here — if it were not for one huge presumption: That researchers don’t follow up big experiments with smaller ones. 

Large biological experiments (e.g. anything ending in “omics”) produce massive datasets impenetrable to the isolated human mind. Unfortunately, no human can look at thousands of unprocessed data points and see the pattern. We can however, apply mathematical modelling to help explain data. 

These models are not the endpoint. They are not the answer. If they were, Taleb would have a point. These models are, in the very classical sense, hypotheses. Just as a human can look at several data points from a small experiment and derive a testable hypothesis, computational interpretation of big datasets can be used to produce a testable hypothesis from thousands of data points. 

And where there is a hypothesis, there is an experiment to test it.  

The needle may come “in an increasingly larger haystack." We just use a robot with a magnet to help sift it — then check the needle isn’t hay. It’s science.

Maybe Taleb’s cynicism comes from a career in the financial sector. It’s not a domain famous for reproducible, predictive or testable empiricism.