Tim Hartford has a good piece on FT about the promise and perils of “Big Data” (Big Data: are we making a big mistake?). The article is a balanced view of the value that the latest data-driven analyses can provide, as well as the easy ways in which the same data can mislead us into making incorrect conclusions.
As the author notes:
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”
Mr. Spiegelhalter is correct. In fact, my guess is that many of today’s efforts to work with big data sets are raising more questions than they answer. Why? Because understanding of statistics is not increasing at the same pace as the technology, which means that in many cases the data are ahead of the ability to understand them.
To illustrate this point, Hartford notes the following example:
In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, “Why Most Published Research Findings Are False”. The paper became famous as a provocative diagnosis of a serious issue. One of the key ideas behind Ioannidis’s work is what statisticians call the “multiple-comparisons problem”.
It is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern “statistically significant”.
The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.
There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.
I often tell clients that there are three parts/challenges in any analytical effort:
- Acquisition: getting the right data
- Calculation: executing the right operations
- Interpretation: reaching the right conclusions
The first two are often hard but the third, which is where the real value is, can be hardest of all in many cases. Hartford’s solid piece is a good reminder of how that third challenge will get harder as more and more waves of data reach challenge our ability to understand whatever message they may carry.