When big data doesn’t equal big knowledge

While big data analytics have been touted for their ability to find signals in a sea of noise, they cannot tell what those signals mean. Without a solid grasp of what data is being mined, knowledge of its accuracy and why and how it is being mined, big data can end up causing more problems than it solves.

This problem can be most acutely seen in the public health arena, where the amount of data is increasing exponentially.

“Paradoxically, the proportion of false alarms among all proposed ‘findings’ may increase when one can measure more things, Muin Khoury and John Ioannidis wrote in a recent report, “Big Data Meets Public Health.” That’s what happened when Google dramatically overestimated peak flu levels, basing the analysis on flu-related Internet searches.

Analytics, in other words, is only as good as its data foundation -- which, in some cases, is shaky. “Research accuracy is dictated by the weakest link," the authors said, with current analytics often based on “convenient samples of people or information available on the Internet.”

Information gleaned from the Internet needs to be integrated with other data and interpreted with “knowledge management, knowledge synthesis and knowledge translation,” Khoury and Ioannidis stated. Machine learning algorithms can help -- although, again, as Microsoft learned when its Twitter bot Tay went off the rails, parameters must be set to avoid havoc when data is collected.

Put another way, big data is a collection of “raw observations that have limited value by themselves. What gives a raw observation value…is placing it in an interpretive context to yield information,” wrote Dr. Ida Sim in an article in the Annals of Internal Medicine. “An algorithm may detect a pattern in a database but have no way of recognizing whether the result is true, spurious or affected by bias.”

Solid results from big data analytics goes beyond the data itself. Using big data for true precision medicine, not only requires “clean, complete and standardized datasets,” but cooperation and collaboration from those involved, such as the federal government, research organizations and health IT developers, Jennifer Bresnick wrote in an article in Health IT Analytics.

As the amount of data continues to grow so will the problem of incorrect analysis.

However, Khoury and Ioannidis said, “the combination of a strong epidemiologic foundation, robust knowledge integration, principles of evidence-based medicine and an expanded translation research agenda can put big data on the right course.”