Billions of data points won’t necessarily save us from the easy pitfalls of misinterpreting our results, or the dangers of hidden biases that lurk in the construction of our original massive datasets, warned technology reporter and Ph.D. student at Jesse Dunietz, in a startling essay last week, “How Big Data Creates False Confidence.”
“It’s tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn’t be wrong,” he wrote. “But the bigness of the data can imbue the results with a false sense of certainty.
“Many of them are probably bogus — and the reasons why should give us pause about any research that blindly trusts big data.”
Earlier this month MIT released an enthusiastic article about “data capital” (calling the actionable intelligence “the single biggest asset at most organizations”). But now Dunietz weighs in with his word of caution about “big data hubris.” And he comes up with some telling examples:
There’s something that’s almost arbitrarily authoritative in the ability to generate century-long graphs from unusually large data sets.
Google’s “Ngram Viewer” tool allows searches through massive troves of digitized books and plots the frequency of phrases across the centuries. The tool’s blue button tempts visitors to “Search Lots of Books” — Google reportedly has scanned about 4 percent of all the books ever published in the world. But Dunietz cites the work of researchers at the University of Vermont, who last fall noted that Google’s data set now includes a significant chunk of scientific texts. “The result is a surge of phrases typical to academic articles but less common in general,” they wrote, adding that the corpus of books is still “important,” but more as a reference lexicon than a library of popular books.
While Google’s smaller collection of English-language fiction avoids the pitfall of too much scientific literature, “Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus,” the Vermont researchers wrote, “and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.”
And that’s not the only issue. With just one copy of every book, the most obscure books enjoy equal authority with books that were much more popular, widely read, and influential in their time. “From the data on English fiction, for example, you might conclude that for 20 years in the 1900s, every character and his brother was named Lanny,” Dunietz writes. But the uptick is apparently just a weird data artifact created when Upton Sinclair published 11 different novels about Lanny Budd between 1940 and 1953.
At what point does it stop being science? Once, XKCD cartoonist Randall Munroe shared a chart from the tool where a single blue line showed the frequency of the phrase “blue line” in Google’s books over the last 45 years. He created a webpage with dozens of charts, showing the frequency of everything from a four-letter word to uses of the word “hope.” The phrase “upward trend” has apparently been trending downwards over the last 45 years, while “explosion in popularity” has, in fact, seen an explosion in popularity since 1975. But Munroe seems to be hinting at a similar satirical point. There’s something that’s almost arbitrarily authoritative in the ability to generate century-long graphs from unusually large data sets.
“A few decades ago, evidence on such a scale was a pipe dream,” Dunietz wrote. “Today, though, 150 billion data points is practically passé…” For another cautionary example, he also cites the surprisingly mixed results from Google’s short-lived “Flu Trends” tool, which first claimed a 97 percent accuracy rate, but later turned out to be wildly off. “It turned out that Google Flu Trends was largely predicting winter,” writes Dunietz.
But this anecdote leads him to a final note of optimism about our pursuit of patterns since another researcher at Columbia University was able to outperform both the CDC and Google algorithms simply by combining both sets of data, and then using one to fine-tune the predictions of the other. “When big data isn’t seen as a panacea, it can be transformative…” Dunietz concluded.
“All it takes is for teams to critically assess their assumptions about their data.”
WebReduce
- Austin’s new disruptive micro-housing startup
- Parts of Wikipedia may get delivered to the moon by an X-Prize team
J.J. Abrams released a documentary about Google’s
civilian “Moon Shot” contest in March - Fired Reddit launches a “warmer, fuzzier” competing service
- The rise of pirate libraries
- Using artificial intelligence to fight pancreatic cancer
- “Watch me control my Tesla with Amazon Echo.”