The LHC isn't simply the most powerful particle accelerator ever created. Handling the huge amounts of data it produces has required the creation of one of the biggest computer grids on the planet. The planning and testing of the compute facilities has been taking place for years, but it's only recently that the grid has had to deal with the output from actual collisions. How did it do? "From the IT perspective, we didn't notice when the beams came on," said CERN's Wolfgang von Rueden, "We had tested it with much higher throughput conditions."

Still, not everything is working quite according to plan. von Rueden said that the initial expectations for the LHC's computing grid had anticipated lower network performance and a reliance on tape; instead, the network has made it easier to shuffle large data sets between compute centers, and the price and performance of hard drives have turned out better than expected. Von Rueden gave us a brief overview of the computing setup at CERN, what they've learned from putting everything in place for the LHC, and how some major companies are relying on CERN's experience to improve their products.

Moving data

Although CERN has some significant computing resources, they contribute only about 10-15 percent of the total CPU power dedicated to the data coming out of the LHC, and most of the work on-site is dedicated to cutting down on the data that has to be stored. The LHC produces far more data than we can possibly store, and most of the collisions produce a spray of mundane particles we're already aware of. A significant amount of computing power—about 2,000 CPUs at each of the four detectors—is dedicated to filtering the interesting collisions out of the background.

(One physicist compared this process to identifying traffic accidents when given footage of an intersection where the traffic light changes colors millions of times between incidents.)

The filtering reduces the flow of data from a petabyte (PB) a second to a gigabyte per second, which is then transferred from the detectors to the main compute facility via a dedicated 10Gbps connection. Once there, it needs to be stored, and von Rueden told us that the initial plan had been to use tape for that; right now, they have about 50PB of tape storage, handled by a set of robotic storage hardware. Still, they've been finding that disk storage is working well, and have scaled that up to 20PB worth of storage.

The disks are managed as a cloud service, with arrays of drives hooked up to Linux boxes in JBOD (Just a Bunch Of Disks) mode. Each of these boxes gets a 1Gbps connection, and any CPU in the storage cluster can read data from any of the disks.

One of the reasons for the increased reliance on disks is the network that connects the global grid to CERN. "Because the networking is going so well, filling the pipes can outrun tapes," von Rueden told Ars. Right now, that network is operating at 10 times its planned capacity, with 11 dedicated connections operating at 10Gbps, and another two held in reserve. Each connection goes to one of a series of what are called Tier 1 sites, where the data is replicated and distributed to Tier 2 sites for analysis. Von Rueden said that the fiber that powers this setup has been "faster, cheaper, and more reliable than in planning."

The original plan had been that each of the Tier 2 sites would keep a specific subset of the LHC data, and analysis jobs (which, being code, should be relatively compact) would be sent across the network to wherever the data resides. Instead, it's turned out that the network performs so well that the data can be streamed anywhere on the grid in real time, which has made things significantly more flexible.

Supporting users and companies

Although the LHC's computing needs are fairly unique, CERN's IT staff faces issues that would sound familiar to anyone in corporate IT. Its datacenter is 35 years old, and is now hosting clusters in a space that was intended to support supercomputers. Von Rueden said that density is becoming a problem, and the building isn't designed to take advantage of environmental services or do anything with its waste heat. Although there are some fixes that can be applied—several racks of critical equipment in the datacenter are now water cooled—most of the focus for CERN has been on getting the most performance per Watt used for computing.

In recent years, improvements in that area have come fast enough that CERN no longer bothers with hardware support contracts longer than three years. Machines just get run until they're dead; the cost of the replacement, and the power savings replacement hardware brings, means it doesn't make economic sense to keep the hardware going.

Beyond the hardware, CERN has to provide support to a very diverse user community, as physicists generally come to the site with their own laptop, and many of them run what von Rueden called "non-mainstream platforms." It runs a huge wireless network, and has to provide services for visitors, from members of the press to dignitaries. Right now, CERN servers support over 20,000 e-mail users, and provide remote access to files and the internal network for many users.

Combined with the LHC's unique computing needs, all of this makes CERN an excellent testbed for many technology companies. CERN runs a number of informal collaborations with industry but, for its IT problems, companies are encouraged to take part in the openlab program, which von Rueden runs. Right now, HP, Intel, Oracle, and Siemens all have openlab projects, in which they pay for dedicated staff at CERN and coordinate their work with their own employees.

Siemens, for example, is trying to learn how to better deploy and manage computerized control systems, while HP's ProCurve group is trying to learn from CERN's ability to manage a heterogeneous and relatively open computer network. At the level of user and guest networking, CERN is fairly liberal about the hardware it lets onto the system, which means it has to have software that can detect unusual behavior and disconnect any devices suffering from it; HP is looking at building some of these network management features into its products.

CERN stores metadata for the collision events in an Oracle database, and a major requirement here—rapid synchronization across a global network—has caused Oracle to launch an openlab project in which it tests various streaming and compression schemes to up performance over the network. Finally, Intel tests compiler technology using some of the high-performance computing tasks required at CERN, tests its processors under real-world loads, and sponsors people who are converting some physics workloads and code libraries to support the multithreading needed to take advantage of the current generation of processors.

Most companies won't face issues on the same scale that CERN does (or be lucky enough to have their plans outperform expectations. But there are a lot that will likely benefit from some of the technology being developed there. CERN's own users will as well, but they tend to be a picky bunch. "Whatever you can give a physicist," von Rueden said, "he wants two times as much."

Listing image by A robot manages some of CERN's 50PB of tape storage.