Yahoo Releases Massive Data Set To Academic Institutions

The clicks and searches of 20 million anonymous Yahoo users could help researchers in a number of different academic institutions expand the boundaries of machine learning and deep learning.



Big Data Predictions For 2016 (Click image for larger view and slideshow.)

Yahoo is releasing a massive machine learning dataset to the academic research community, which contains the surfing and search habits of 20 million anonymous users.

The dataset, which will only be made available to academic institutions, can be used by researchers for context-aware learning, large-scale learning algorithms, user-behavior modeling, and content enrichment. It can also validate recommender systems.

The collection is based on a sample of user interactions on the news feeds of several Yahoo properties. As it stands, it's a massive 110 billion lines of data charting the interaction of users with news items.

The dataset includes information gathered from the Yahoo home page, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. It was collected by recording the user-news item interaction of about 20 million users from February 2015 to May 2015.

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," Suju Rajan, director of research at Yahoo Labs, wrote in a Jan. 14 statement announcing the release of the data set. "We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state of the art in machine learning and recommender systems."

The dataset is available as part of the Yahoo Labs Webscope data-sharing program, a reference library of datasets composed of anonymous user data for non-commercial use.

Yahoo is also releasing the title, summary, and key phrases of the pertinent news articles included in the data set, and providing demographic information such as age segment and gender.

Other information, including the location in which the user is based, will be provided. The interaction data is time-stamped with the user's local time and contains partial information of the device on which the user accessed the news feeds.

"The release of this large Yahoo News Feed dataset will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science," wrote Andrew McCallum, director of the UMass Amherst Center for Data Science and a professor in the College of Information and Computer Sciences, in Yahoo's statement.

[Read more about machine learning.]

Yahoo's announcement is indicative of a recent trend towards the advancement of machine learning and deep learning, in which computers use massive reams of data to make predictions or better understand population sets.

In December, Facebook announced that it would open source its latest artificial intelligence (AI) server designs. Codenamed Big Sur, the server is designed specifically to train the newest class of AI algorithms that mimic the neural pathways found in the human brain. These algorithms are collectively called deep learning.

"No matter how much talent you have, there is always more on a manager's bucket list," Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon University, told The Wall Street Journal. "No one in these big technology companies feels like they have enough people to do the things they want to do."

Nathan Eddy is a freelance writer for InformationWeek. He has written for Popular Mechanics, Sales & Marketing Management Magazine, FierceMarkets, and CRN, among others. In 2012 he made his first documentary film, The Absent Column. He currently lives in Berlin. View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.