Big data evangelists love the story of Google Flu Trends. In 2008 the search engine company created a feature that geographically tracks searches for flu-related words over time—presumably words like cough or fever (Google doesn’t release the exact words). The thinking was that people who were developing flu symptoms would turn to Google to self-diagnose, and looking at where people were doing that would create a real-time map of flu epidemics. By contrast, the Centers for Disease Control and Prevention and other public health agencies take up to two weeks to come out with their numbers. It was a perfect example of the way the enormous mass of digital footprints people leave on the Web could be sifted for helpful—in this case potentially life-saving—information. The fact that the dataheads at Google (GOOG) were showing up the eminent epidemiologists at the CDC only added to the story’s allure.
It turns out, however, that Google Flu Trends isn’t actually that good at tracking flu trends. A news article in the journal Nature a year ago found that Google’s flu tracker had found twice the number of cases than what labs were reporting to the CDC—the CDC numbers aren’t perfect, either, but since Google Flu Trends is set up to predict the CDC numbers in advance, that divergence is, by definition, a failure. A paper just out in the journal Science has found that the overestimating has been prevalent. “GFT also missed by a very large margin in the 2011-2012 flu season and has missed high for 100 out of 108 weeks starting with August 2011,” the authors of the new paper write. The program also entirely missed the nonseasonal 2009 A-H1N1 outbreak. The Science paper is an attempt to figure out why that happened.
Part of the problem, the Science researchers argue, is what’s known in statistics as “overfitting”—in essence, spurious correlations. The Google flu tracker was created by comparing the incidence of 50 million different search terms to the known incidence of the flu, and seeing which ones matched. If searches for certain terms peaked when flu cases peaked, then going forward when Google spotted a rise in those search terms it figured it could assume flu cases were also rising. But with 50 million candidate terms, there were bound to be some that matched even though they had nothing to do with the flu—it’s the million-monkeys-on-a-million-typewriters scenario.
One example that Google’s own developers reported were search terms that related to high school basketball. Does high school basketball cause the flu? No. But they both take place mostly in the winter, so the curves representing the frequency and timing of basketball- and flu-related searches lined up nicely.
Realizing this, Google’s developers excluded high school basketball terms and some others from the flu tracker’s setup, but over time, the program’s predictions continued to be high. The Science paper argues that this is because of a second problem, which springs from the nature of Google search. Google is constantly adjusting and refining its search algorithm. A search for “fever” or “flu symptoms” in 2008 would return a very different results page from one today. In particular, Google today is more likely to suggest additional related search terms when a search is made—besides any public health considerations, for Google there’s a motivation to get people to do as many searches (and see as many ads) as possible. This could increase the number of flu-related Google searches and make it look to the flu tracker as if there are more actual cases.
According to the Science researchers, the flu tracker’s programmers failed to adapt its own algorithm to the changes in the Google search engine algorithm that was feeding it information. David Lazer, a Northeastern University political scientist and one of the authors of the new paper, refers to this as “model drift.”
“The Google flu team sort of neglected some of the basic lessons that we know about data, like that the relationship between flu-related searches and the prevalence of the flu might change over time,” he says. Lazer and his co-authors were not able to reach the Google programmers whose work they are critiquing, he says. When I contacted Google, a spokesperson sent this response: “We review the Flu Trends model each year to determine how we can improve—our last update was made in October 2013 in advance of the 2013-2014 flu season. We welcome feedback on how we can continue to refine Flu Trends to help estimate flu levels.”
There are broader lessons, Lazer argues, in the case of the faulty flu tracker. One is that there are risks that come from doing scientific research using things like Google or Twitter or Facebook that decidedly weren’t designed to be research tools, that are constantly evolving, and whose inner workings are far from transparent. The other is that having lots and lots of data is not the same thing as having good data.
(Updated with Google comment in seventh paragraph.)