FEATURE

Fortune-telling

and other uses of big data at Virginia Tech

A group of scientists at the Virginia Bioinformatics Institute is using network dynamics to analyze the patterns of people's movements to help decision-making in the face of a natural disaster or epidemic outbreak.

So long, crystal balls.

Through the use of big data, Naren Ramakrishnan and his team from the computer science department's Discovery Analytics Center (DAC) may make forecasting the future as commonplace as forecasting the weather.

The term "big data" refers to the use of algorithms and other tools to train computers to spot trends in collections of information that are too massive and complex to analyze with traditional methods. The proliferation of data has accelerated with the integration of computers into our daily lives, from social media on our phones to tracking buying habits at the grocery store.

Virginia Tech's efforts stand at the forefront of the big data movement, with labs and professors across the commonwealth conducting increasingly data-driven research as the university looks to build additional capacity for future initiatives. Maintaining a strong presence in Blacksburg as well as in the National Capital Region allows for significant collaborations in the domains of intelligence analysis, national security, and health informatics.

"To Virginia Tech's researchers, big data represents an important opportunity to create knowledge and provide insight by leveraging large, potentially unstructured data sets," said Scott Midkiff, the university's vice president for information technology and chief information officer and a professor in the Bradley Department of Electrical and Computer Engineering.

Projects like DAC's EMBERS and the Virginia Bioinformatics Institute's (VBI) Network Dynamics and Simulation Science Laboratory (NDSSL), which simulates disasters to evaluate emergency response and disaster preparedness policies, are telling examples of big data's potential.

Forecasting the future

EMBERS, the acronym for "early model-based event recognition using surrogates," provides a continual, automated analysis of open-source data—everything from Facebook posts and website searches to satellite images and restaurant reservations made online—to forecast significant societal events such as disease outbreaks, domestic crises, and elections in countries around the globe.

Once a trend or pattern is recognized, EMBERS applies thresholds learned by the algorithms that process past data and events. If the threshold is met, an alert is sent to a third party for evaluation. Training the computers to recognize trends is not very different from teaching an email system to recognize spam, said Ramakrishnan, the Thomas L. Phillips Professor of Engineering and DAC's director.

"The science of big data is about designing algorithms that can transform raw data into actionable knowledge or intelligence," Ramakrishnan said. "There isn't one specific, magic algorithm or threshold in EMBERS. There are a variety of data filters and distinct models trained to identify different patterns. All these models' outputs are then fused into the final model that forecasts the event and produces the alert."

EMBERS now sends 40 to 50 alerts per day to its clients.

EMBERS: Student protests in Venezuela

EMBERS successfully forecast student-led protests in Venezuela that initially began due to the attempted rape of a student, but morphed into broader protests against police brutality and other issues. In addition, EMBERS also forecast that the protests would turn violent and that they would spread to multiple cities.

Spread of protests in Venezuela, January and February 2014

CLICK IMAGE TO ENLARGE

"In EMBERS, when we say forecasting, we really are forecasting," Ramakrishnan said. "A lot of projects have the benefit of hindsight, and [people] look back and say, 'Oh, we could have predicted that,' but we send forecasts before the event happens."

Rather than filtering just a few hundred emails, though, EMBERS since its inception has collectively sorted through more than 21 terabytes of data, looking only at a small portion of the world. For perspective, 1 terabyte of data could store 1,000 to 5,000 movies.

EMBERS processes between 200 and 2,000 messages—a tweet, news item, blog post, or stock value—per second. With such a wide breadth of information, there are bound to be widespread inaccuracies, such as rumors, spam, or news stories that are later redacted. However, EMBERS' algorithms are designed to weed out misinformation, Ramakrishnan said.

Not surprisingly, EMBERS is getting attention from the federal government; the project is funded by the Intelligence Advanced Research Projects Activity (IARPA), which is part of the Office of the Director of National Intelligence. DAC was one of three teams chosen to compete in IARPA's Open Source Indicators (OSI) program. Starting in April 2012, DAC's team vied for full funding from IARPA, alongside industry competitors Massachusetts-based Raytheon BBN Technologies and California-based HRL Laboratories.

For two years, the three teams focused their forecasts on about 20 countries in Latin America. EMBERS accurately forecasted several events there, including riots following the impeachment of Paraguay's president in 2012, Hantavirus outbreaks in Chile and Argentina in 2013, and elections in Panama and Colombia in 2014.

IARPA monitored the three teams' progress while an independent government contractor assessed the quality of forecasts. Each month, EMBERS and the other teams would receive a scorecard evaluating their forecasts based on five criteria: lead time, mean probability score, quality score, recall, and precision.

EMBERS scored at or above target in most of the categories, forecasting events with a mean lead time of 7.54 days. Of the three teams awarded an initial contract, DAC was the only team to secure a contract for the third and final year of funding. (DAC expects to secure funding to continue its forecasting work.)

Jason Matheny, OSI program manager at IARPA, said DAC's team has "been able to accurately forecast hundreds of societal events, days to weeks before they occur, with a low false-alarm rate."

DAC has widened its focus from Latin America to the Middle East and North Africa. Since June 2014, EMBERS has been sifting through information gathered from seven Middle Eastern countries, including Bahrain, Egypt, Iraq, Jordan, Libya, Saudi Arabia, and Syria.

Because of the geographic change, Ramakrishnan and his team have had to adapt several models to the new region. DAC now has a Middle East expert on its team to help understand the complex linguistics, which vary between dialects and between written and spoken word, and the myriad cultural differences from country to country.

"In the Middle East, expression of discontent happens differently than in Latin America. You have to have a much better local understanding of how people voice concerns and how they communicate [their discontent], for instance," Ramakrishnan said.

While forecasting the future may sound fanciful, it holds a number of practical applications.

"Forecasting civil unrest is useful for people and groups as they make travel plans," Ramakrishnan said. "It also helps governments understand what people are frustrated about, know what the hot-ticket items are, and [decide] what they can do about it. It helps them understand what the citizens' priorities are. What are the most important grievances?"

Simulating disasters

Big data initiatives are leading the way to predicting the future—and they are being used to determine how to deal with that future.

VBI's NDSSL created a simulation environment using big data methods to evaluate disaster preparedness policies and interventions.

Madhav Marathe, a VBI professor and NDSSL director; Christopher Barrett, a professor and VBI's executive director; and Stephen Eubank, a professor and NDSSL deputy director, led a large team that modeled human behavior using a combination of many data sources to simulate a nuclear detonation in Washington, D.C., depict the behavior of more than 730,000 simulated D.C. residents, and evaluate the emergency response.

NDSSL disaster resilience study

A simulated nuclear blast in Washington, D.C.

CLICK IMAGE TO ENLARGE

The light gray gradient indicates the radiation dosage from fallout. The bars indicate aggregate counts of individuals in different health states at the various locations.

1) NDSSL collected open-source information (census and infrastructure data, etc.) to create more than 730,000 synthetic individuals in a simulated infrastructure.

2) The model tracks behavior and how individuals interact with infrastructure. For instance, availability of power affects ability to communicate, route traveled exposes person to radiation and to risk of injury, and health state determines a person's likely behaviors.

3) Decision-makers in public safety and other areas can use simulations to improve disaster resilience by taking proactive measures.

Using massive amounts of data, including the American Community Survey, tourism reports, transportation routes, cell-tower communication data, hospital registries, power-network data, and surveys of human behavior in disasters, the team generated synthetic individuals to gauge their likely motivations and reactions in the midst of the disaster.

"The event … allows us to collect information from varied sources and build a synthetic, but realistic representation of the event, as well as what I would call a physical world, the infrastructure world, and a social world," Marathe said. "All three worlds have to come together and be represented meaningfully to do the analysis because otherwise you're missing one of the three things."

Encompassing a 48-hour span in the midst of a nuclear disaster, the simulation produced several terabytes of data, the result of the unimaginably complex algorithms and computer modeling the team had created. Millions of simulated individuals were incorporated into a single, mineable dataset based on real-world information, Barrett said.

The team found that even a small increase in the ability to provide functional communication systems would allow people to do a substantially better job coordinating activities such as finding family members. Because humans' first instinct in the wake of a disaster is to use their phone, communication systems tend to falter with the magnitude of texts and calls. Such findings allow the lab to provide decision-makers with better information.

"This is a really important finding, and this could not have been done in this particular form had we not put all the data together, filled in, made a consistent representation, taken the things forward, and then mined for nuggets within this," Marathe said.

Said Barrett, "Even though human behavior is a black box in a black box in a black box, we still can come very close to getting very rational, reasonably stated ways that you would expect people to move."

With the rapid pace of technological advances, information from big data simulations can be generated more quickly than ever. Marathe said the time it takes to run a simulation has decreased from a couple of days to mere minutes.

In addition to improved technology, Eubank attributes the growth of big data to the changes in the way society collects information.

"We had no idea that 20 years from when we started a transportation project that it would be commonplace for people to report their location on a minute-by-minute basis to the world," Eubank said.

Living in a data-driven world

Scientists and researchers working with big data foresee even more innovation on the horizon.

In fact, those like Ramakrishnan, Marathe, Barrett, and Eubank—who have made a habit of dealing with the future—see the future of big data happening at Tech.

"I think that Virginia Tech has provided us with an environment and ecosystem to carry out this research over the [past] 10 years which has been very, very conducive to do this and I certainly value this. Tech has been very supportive of our work," Marathe said. "It is very cool to have an institute that allows us to do things in a very novel and aggressive way."

Barrett sees their big data research as world-leading, explaining that Virginia Tech's approach to computationally enabled social science and the development of a synthetic information platform are conceptually different from anything else in the field.

Ramakrishnan also echoes the sentiment that Tech is at the forefront of big data research. "By creating DAC, we have brought together an interdisciplinary group of researchers from computer science, statistics, electrical and computer engineering, and mathematics. We have initiated graduate and undergraduate courses in this topic and hope to be a one-stop shop for the university and beyond in leading research and educational efforts in big data. The IARPA EMBERS project is an example of how DAC has led an interdisciplinary effort in this space, and we have just begun," he said.

As Virginia Tech's researchers continue to develop new uses for big data, the university has upgraded its computer systems to keep pace and ensure the capacity to house the collected information. Midkiff, the university's vice president for information technology, sees collections of big data as a chance to re-evaluate Virginia Tech's missions and operations.

The investment in big data initiatives in Blacksburg and in the National Capital Region allows for greater connections with industry partners while also making use of data to better serve society. "By improving the lives of people who actually produce social data, big data is more than just a passing trend," said Christopher Walker, DAC program manager.

The wave of research also is moving into classrooms as the university presents students with more opportunities to innovate. Many degree programs—computer science, electrical engineering, statistics, and many more—already include big data elements, while two interdisciplinary undergraduate degrees have been introduced (see the sidebar at right).

"Virginia Tech is working to ensure that all of our graduates are prepared to thrive in a society that is data-driven and networked," Midkiff said.

No matter what the future holds, big data research has found a home at Tech.

Madeleine Gordon, a senior English and communication major, was an intern with Virginia Tech Magazine. Emily K. Alberts, formerly the Discovery Analytics Center's public relations and marketing specialist and now the Department of Engineering Education's office manager, contributed to this article.