Online music developers and Pandora and Last.fm lovers, take note. The next phase in research on how to deliver smart music delivery systems is underway, facilitated by a Million Song Dataset just released by The Echo Nest music application company.

The dataset is "freely-available collection of audio features and metadata for a million contemporary popular music tracks," being analyzed by Columbia University's Lab ROSA, aka the Laboratory for the Recognition and Organization of Speech and Audio. And the Holy Grail is to use this treasure trove to develop a new generation of Music Information Retrieval services—venues that pay attention to what you are listening, analyze the components, and offer up new songs and compositions that you'll like.

And that "freely available" bit means that you can download the dataset here.

Novel applications

Why is this windfall suddenly at everyone's disposal? A critical mass of Internet music industry folk and academic researchers say it's needed to get to the next level in online audio services.

Right now big services like Pandora rely on trained musicologists to build and expand its vast music genome—a database of characteristics that the company taps into to decide what to suggest for you next. But for years Lab ROSA has been experimenting with ways to analyze music tracks via computer systems or "machine learning."

The National Science Foundation undertaking is called The Listening Machine Project. Launched eight years ago, the Project's main obsession was analyzing "the individual sources present in a real-world sound recording"—something at which humans are skilled, machines less adept.

"Almost without exception, sounds of interest are embedded in a context of competing sounds, and it is rare to be given an unobstructed view of an ideal, isolated target," noted LMP's Dan Ellis, Associate Professor of Electrical Engineering at Columbia, in the project's original proposal to the NSF.

"Human listeners, in common with other auditorily-equipped animals, are adept at handling such mixed signals, but our best computational audition systems—for instance automatic speech recognizers —are highly vulnerable to added interference, even at levels that listeners barely notice."

The ability to analyze sounds on an unprecedented scale could facilitate better perception systems for robots, Ellis argued, new prosthetic devices for people with hearing difficulties, and "a wide range of novel applications in content-based multimedia indexing."

Robots need more music

The impediment to the last goal has always been the relatively small scale of most music characteristic databases.

"One of the long-standing criticisms of academic music information research from our colleagues in the commercial sphere is that the ideas and techniques we develop simply aren't practical for real services," Ellis observed in a recent blog post, "which must offer hundreds of thousands of tracks at a minimum."

A variety of factors have made a bigger database difficult to obtain, Ellis added, most notably the "well-known antagonistic stance of the recording industry to the digital sharing of their data," which "seems to doom any effort to share large music data collections." And of course the expense of such a collection has always held researchers back.

But the need for a mega-sized dataset has become increasingly obvious. It will help reveal problems with algorithms that don't turn up in small sets, but which often manifest themselves in much larger scale applications. And a single, open dataset will allow many researchers to compare their results.

And so the 106 song database is welcome news. The Echo Nest makes applications for a wide variety of music services. These include the BBC Music Showcase, the iTunes scanning Pocket Hipster, SXSW picks ("an application that recommends bands to see at SXSW, based on users' tastes"), and over 160 other apps.

"We're hoping that this will not only give MIR researchers plenty of data to work with, but also strengthen the connection between academic research and commercial development," Echo Nest says in its post on the dataset.

Getting data and code

As already noted, the Million Song Dataset is open to the public, but be forewarned—it's 300GB in size. What you are downloading is not audio, but the derived features of a million tunes in database table form. To test out your project, sample sound can be gotten from 7digital using code that Lab ROSA provides.

So you might want to take a look at a subset of all that data, first—around 10,000 songs "for a quick taste," Lab ROSA suggests.

"Our hope is that the Million Song Dataset becomes the natural choice for researchers wanting to try out ideas and algorithms on data that is standardized, easily obtained, and relevant to both academia and industry," Ellis' blog post concludes.

"But for all this to come true, we need lots of people to start using the data."