A little while back I wrote a short post about some research that some colleagues and I did using “open data” from the Large Hadron Collider [LHC]. We used data made public by the CMS experimental collaboration — about 1% of their current data — to search for a new particle, using a couple of twists (as proposed over 10 years ago) on a standard technique. (CMS is one of the two general-purpose particle detectors at the LHC; the other is called ATLAS.) We had two motivations: (1) Even if we didn’t find a new particle, we wanted to prove that our search method was effective; and (2) we wanted to stress-test the CMS Open Data framework, to assure it really does provide all the information needed for a search for something unknown.

Recently I discussed (1), and today I want to address (2): to convey why open data from the LHC is useful but controversial, and why we felt it was important, as theoretical physicists (i.e. people who perform particle physics calculations, but do not build and run the actual experiments), to do something with it that is usually the purview of experimenters.

The Importance of Archiving Data

In many subfields of physics and astronomy, data from experiments is made public as a matter of routine. Usually this occurs after an substantial delay, to allow the experimenters who collected the data to analyze it first for major discoveries. That’s as it should be: the experimenters spent years of their lives proposing, building and testing the experiment, and they deserve an uninterrupted opportunity to investigate its data. To force them to release data immediately would create a terrible disincentive for anyone to do all the hard work!

Data from particle physics colliders, however, has not historically been made public. More worrying, it has rarely been archived in a form that is easy for others to use at a later date. I’m not the right person to tell you the history of this situation, but I can give you a sense for why this still happens today.

The fundamental issue is the complexity of data sets from colliders, especially from hadron colliders such as the Tevatron and the LHC. (Archiving was partly done for LEP, a simpler collider, and was used in later studies including this search for unusual Higgs decays and this controversial observation, also discussed here.) What “complexity” are we talking about? Collisions of protons and/or anti-protons are intrinsically complicated; particles of all sorts go flying in all directions. The general purpose particle detectors ATLAS and CMS have a complex shape and aren’t uniform. (Here’s a cutaway image showing CMS as a set of nested almost-cylinders. Note also that there are inherent weak points: places where cooling tubes have to run, where bundles of wires have to bring signals in and out of the machine, and where segments of the detector join together.) Meanwhile the interactions of the particles with the detector’s material is messy and often subtle (here’s a significantly oversimplified view). Not every particle is detected, and the probability of missing one depends on where it passes through the detector and what type of particle it is.

Even more important, 99.999% percent of ATLAS and CMS data is discarded as it comes in; only data which passes a set of filters, collectively called the “trigger,” will even be stored. The trigger is adjusted regularly as experimental conditions change. If you don’t understand these filters in detail, you can’t understand the data. Meanwhile the strategies for processing the raw data change over time, becoming more sophisticated, and introducing their own issues that must be known and managed.

I could easily go on (did I mention that at the LHC dozens of collisions occur simultaneously?) If, when you explore the data, you fail to account for all these issues, you can mistake a glitch for a new physical effect, or fail to observe a new physical effect because a glitch obscured it. Any experimentalist inside the collaborations is aware of most of these subtleties, and is surrounded by other experts who will be quick to complain if he or she forgets to account for one of them. That’s why it’s rare for the experimenters to report a result that has this type of error embedded in it.

Now, imagine writing a handbook that would encapsulate all of that combined knowledge, for use by people who will someday analyze the data without having access to that collective human library. This handbook would accompany an enormous data set taken in changing conditions, and would need to contain everything a person could possibly need to know in order to properly analyze data from an LHC experiment without making a technical error.

Not easy! But this is what the Open Data project at CERN, in which CMS is one of the participating experiments, aims to do. Because it’s extremely difficult, and therefore expensive in personnel and time, its value has to be clear.

I personally do think the value is clear, especially at the LHC. Until the last couple of decades, one could argue that data from an old particle physics experiment would go out of date so quickly, superseded by better experiments, that it really wasn’t needed. But this argument has broken down as experiments have become more expensive, with new ones less frequent. There is no guarantee, for instance, that any machine superseding the LHC will be built during my lifetime; it is a minimum of 20 and perhaps 40 years away. In all that time, the LHC’s data will be the state of the art in proton-proton collider physics, so it ought to be stored so that experts can use it 25 years from now. The price for making that possible has to be paid.

[This price was not paid for the Tevatron, whose data, which will remain the gold standard for proton-antiproton collisions for perhaps a century or more, is not well-archived.]

Was Using Open Data Necessary For Our Project?

Even if we all agree that it’s important to archive LHC data so that it can be used by future experimental physicists, it’s not obvious that today’s theorists should use it. There’s an alternative: a theorist with a particular idea can temporarily join one of the experimental collaborations, and carry out the research with like-minded experimental colleagues. In principle, this is a much better way to do things; it permits access to the full data set, it allows the expert experimentalists to handle and manage the data instead of amateurs like us, and it should in principle lead to state-of-the-art results.

I haven’t found this approach to work. I’ve been recommending the use of our technique [selecting events where the transverse momentum of the muon and antimuon pair is large, and often dropping isolation requirements] for over ten years, along with several related techniques. These remarks appear in papers; I’ve mentioned these issues in many talks, discussed them in detail with at least two dozen experimentalists at ATLAS and CMS (including many colleagues at Harvard), and even started a preliminary project with an experimenter to study them. But everyone had a reason not to proceed. I was told, over and over again, “Don’t worry, we’ll get to this next year.” After a decade of this, I came to feel that perhaps it would be best if we carried out the analysis ourselves.

Even then, there was an alternative: we could have just done a study of our method using simulated data, and this would have proved the value of our technique. Why spend the huge amount of time and effort to do a detailed analysis, on a fraction of the real data?

First, I worried that a study on simulated data would be no more effective than all of the talks I gave and all the personal encouraging I did over the previous ten years. I think seeing the study done for real has a lot more impact, because it shows explicitly how effective the technique is and how easily it is implemented. [Gosh, if even theorists can do it…]

Second, one of the things we did in our study is include “non-isolated muons” — muons that have other particles close by — which are normally not included in theorists’ studies. Dropping the usual isolation criteria may be essential for discoveries of hidden particles, as Kathryn Zurek and I have emphasized since 2006 (and studied in a paper with Han and Si, in 2007). I felt it was important to show this explicitly in our study. But we needed the real data to do this; simulation of the background sources for non-isolated muons would not have been accurate. [The experimenters rarely use non-isolated muons in the type of analysis we carried out, but notably have been doing so here; my impression is that they were unaware of our work from 2007 and came to this approach independently.]

Stress Testing the Archive

A further benefit to using the real data was that we stress-tested the archiving procedure in the Open Data project, and to do this fully, we had to carry out our analysis to the very end. The success or failure of our analysis was a test of whether the CMS Open Data framework truly provides all the information needed to do a complete search for something unknown.

The test received a passing grade, with qualifications. Not only we did complete the project, we were able to repeat a rather precise measurement of the (well-known) cross-section for Z boson production, which would have failed if the archive and the accompanying information had been unusable. That said, there is room for improvement: small things were missing, including some calibration information and some simulated data. The biggest issue is perhaps the format for the data storage (difficult to use and unpack for a typical user).

It’s important to recognize that the persons in charge of Open Data at CMS have a huge and difficult job; they have to figure out how to write the nearly impossible handbook I referred to above. It’s therefore crucial that people like our group of theorists actually use the open data sets now, not just after the LHC is over. Now, when the open data sets are still small, is the time to figure out what information is missing, to determine how to improve the data storage, to fill out the documentation and make sure it has no gaps. We hope we’ve contributed something to that process.

The Future

Should others follow in our footsteps? Yes, I think, though not lightly. In our case, five experts required two years to do the simplest possible study; we could have done it in one year if we’d been more efficient, but probably not much less. Do not underestimate what this takes, both in terms of understanding the data and learning how to do statistical analysis that most people rarely undertake.

But there are studies that simply cannot be done without real data, and unless you can convince an experimentalist to work with you, your only choice may be to dive in and do your best. And if you are already somewhat informed, but want to learn more about how real experimental analysis is done, so you can appreciate more fully what is typically hidden from view, you will not find a better self-training ground. If you want to take it on, I suggest, as an initial test, that you try to replicate our measurement of the Z boson cross-section. If you can’t, you’re not ready for anything else.

I should emphasize that Open Data is a resource that can be used in other ways, and several groups have already done this. In addition to detailed studies of jets carried out on the real data by my collaborators, led by Professor Jesse Thaler, there have been studies that have relied solely on the archive of simulated data also provided by the CMS Open Data project. These have value too, in that they offer proving grounds for techniques to be applied later to real data. Since exploratory studies of simulated data don’t require the extreme care that analysis of real data demands, there may be a lot of potential in this aspect of the Open Data project.

In the end, our research study, like most analyses, is just a drop in the huge bucket of information learned from the LHC. Its details should not obscure the larger question: how shall we, as a community, maintain the LHC data set so that it can continue to provide information across the decades? Maybe the Open Data project is the best approach. If so, how can we best support it? And if not, what is the alternative?