NEWS22 April 2014

Wikipedia: a better guide to flu trends than Google

Data analytics News North America

US — Wikipedia traffic might be a more reliable guide to the prevalence of flu-like illnesses in the US than Google searches, according to a new study.


In a paper, published by David McIver and John Brownstein in PLOS Computational Biology, the authors explain how they monitored the rate of particular Wikipedia page views each day, and compared this data to official flu-like activity levels provided by the Centers for Disease Control and Prevention (CDC).

McIver developed a Poisson model that enabled him to estimate the level of influenza-like illness (ILI) activity up to two weeks ahead of the CDC, with an absolute average difference between the two estimates of just 0.27% over 294 weeks of data.

“Wikipedia-derived ILI models performed well through both abnormally high media coverage events (such as during the 2009 H1N1 pandemic) as well as unusually severe influenza seasons (such as the 2012-2013 influenza season),” wrote McIver and Brownstein.

They added that Wikipedia usage “accurately estimated the week of peak ILI activity 17% more often than Google Flu Trends data and was often more accurate in its measure of ILI intensity”.

Google Flu Trends bases its estimates of flu activity on aggregated Google search data. As McIver and Brownstein explain: “Google Flu Trends was initially quite successful in its estimation of ILI activity, but was shown to falter in the face of the 2009 H1N1 swine influenza pandemic due to much-increased levels of media attention surrounding the pandemic.” It also over-estimated ILI activity in the 2012-2013 influenza season for the same reason.

Increased media coverage of flu muddied Google’s estimates as searches by healthy people increased – making it difficult to compare search volume to actual flu prevalence. McIver sought to get round this with his Wikipedia system by factoring in views for irrelevant articles, which would act as markers for the general background-level activity of normal usage.

The full paper is online here. It also contains references to other papers that show how Wikipedia page view data can be used to track “trending” topics and to monitor the emergence of breaking news stories.

Meanwhile, Microsoft’s search engine Bing is trying its hand at prediction – using search activity to forecast the results of voting shows such as The Voice and Dancing With The Stars. How does it work? According to a blog post: “The central idea behind the direct approach is that winners and losers correspond to popularity. In broad strokes, we define popularity as the frequency and sentiment of searches combined with social signals and keywords. Placing these signals into our model, we can predict the outcome of an event with high confidence.”