Valentin Kuznetsov just presented a paper at the International Conference on Computational Science on CERN’s use of MongoDB for Large Hadron Collider data. The paper, The CMS Data Aggregation System, is available as a PDF at ScienceDirect.

A summary

“CMS” stands for Compact Muon Solenoid, a general-purpose particle physics detector built on the Large Hadron Collider. The CMS project posted a few comics which provide a nice, simple (if somewhat cheesy) explanation of what the CMS/LHC does.

The LHC generates massive amounts of data of all different varieties, which is distributed across a worldwide grid. It sends status messages to some of the computers, job monitoring info to other computers, bookkeeping info still elsewhere, and so on.

This means that each location has specialized queries it can do on the data it has, but up until now it’s been very difficult to query across the whole grid. Enter the Data Aggregation System, designed to allow anything to be queried across all of the machines.

How it works

The aggregation system uses MongoDB as a cache. It checks if Mongo has the aggregation the user is asking for and, if it does, returns it, otherwise the system does the aggregation and saves it to Mongo.

They query the system using a simple, SQL-like language which they transform into a MongoDB query. So, something like file="abc", run>10 becomes {"file" : "abc", "run" : {"$gt" : 10}} . (It’s not the same as SQL, but the code for this might be interesting to people who want to use SQL queries with MongoDB.)

If the cache does not contain the requested query, the system iterates over all of the places in the world that could have this information and queries them, gathering their results. It then merges all of the results, doing a sort of “group by” operation based on predefined identifying key and inserts the aggregated information into the cache.

It was built using the Python driver.

Goals

They’re looking forward to field testing it and horizontally scaling the system with sharding. As this is a general grid aggregation/querying tool, they’re also interested in applying it to problems outside of the LHC and CERN.

We wish them luck and hope they’ll keep us informed on future progress!

Edit: the slides from Valentin’s presentation are available at http://www.slideshare.net/vkuznet/das-iccs-2010.

Kristina Chodorow maintains the MongoDB PHP and Perl drivers. She blogs at kchodorow.com/blog and tweets as @kchdorow.