Energy lab team explores new ways of analyzing social media

A team at the Energy Department’s Pacific Northwest National Laboratory is creating an analysis tool that makes spotting trends in social media faster and easier.

The tool, called SALSA, for SociAL Sensor Analytics, takes an automated approach that can sort through billions of tweets or other posts in seconds, identifying patterns and finding key bits of information that could be useful in emergency response, public health and other areas, according to PNNL.

SALSA draws on the computing power of PNNL’s 600-node, 162-Teraflop Olympus supercomputing cluster, so the processing power is there. The challenge was finding a way to dig useful information out from all the banter on a platform that also has a non-traditional lexicon.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives,” Court Corley, a PNNL data scientist who’s leading the research, said in PNNL’s report. “There's a lot of noise, but there's some very valuable information too."

Corley said the reach of social media creates value in analyzing posts to various platforms. "The world is equipped with human sensors — more than 7 billion and counting,” he said. “It's by far the most extensive sensor network on the planet. What can we learn by paying attention?"

Agencies have employed systems to monitor social media, such as the U.S. Geological Survey’s Twitter Earthquake Detector (TED). The system was created in 2009 because Twitter often spreads the word about earthquakes and other disasters faster than traditional sensor methods. But one reason TED works is that USGS taps into Twitter’s API to search for a known keyword — “earthquake.” Taking a deeper dive into analyzing social media is more of a challenge.

The Library of Congress has run into that difficulty with its Twitter archive. LOC has collected hundreds of billions of tweets for use in research, but found that available tools for sifting through the archive are inadequate. Searches take too much time (just searching the 20 billion tweets in its initial collection would take 24 hours), and as a result LOC was turning down requests from researchers to explore the archive.

And the amount of social media-generated data is only going to grow. PNNL points out that, as of mid-2012, information posted to social media outlets each hour included an average of 30 million comments, 25 million search queries 7.1 million photos and 453 years of video footage.

Corley is making progress on that front with his automated approach. Using Olympus’ processing power, PNNL can go through 20 billion entries from a two-year period in less than 10 seconds, PNNL said.

Meanwhile, he and his team are developing ways to analyze the data, establishing what constitutes baseline activity and routine patters in order to identify anything out of the ordinary, PNNL said. They’re also accounting for languages, analyzing data in more than 60 languages as well as the unconventional argot of social media.

Although work on the program is still continuing, PNNL said Corley's program has more than 90 percent accuracy at predicting patterns. So far, most of the work has focused on public health, such as following the spread of the H7N9 flu virus in China. Other potential uses include feeding real-time information to emergency crews during disasters or detecting patterns to predict social unrest or other events.

And if the lab’s search methods continue to improve, perhaps even to the point that they don’t require a supercomputer, they also could become useful research tools for a growing number of social media archives.