BP Fires Up 2.2 Petaflops Cluster for Oil Exploration

There is an arms race in the oil and gas industry, and the weapon of choice is a server cluster.

Energy industry giant BP has opened the doors on a new datacenter in Houston that it says houses the "world's largest supercomputer for commercial research," weighing in at more than 2.2 petaflops. The system, which has not been given a nickname, is part of a five-year, $100 million cluster investment program at BP.

This is a boast that French oil and gas rival, Total Group, will no doubt dispute, having started up its 2.3 petaflop "Pangea" ICE-X cluster, built by SGI, back in March at its Scientific and Technical Centre in Pau. That Pangea machine has 110,400 cores and 7 PB of storage capacity, and the plan calls for the performance of the machine to be doubled again in 2015 for total contract value of $77.3 million.

BP announced that it would be building a new datacenter to house the machine back in December 2012, but all of its feeds and speeds were not revealed at the time. What BP said back then was that the current machine on its Houston campus had a peak theoretical performance of 1.227 petaflops, and that the new machine would crest over 2 petaflops, would be equipped with 536 TB of memory and 23.5 petabytes of disk storage. The BP announcement from last December also said that the newer cluster would have "more than 67,000 CPUs," which would be a truly astounding number of processors.

The new system is primarily based on HP's Scalable System SL6500 server enclosures, which are vanity-free machines aimed at hyperscale cloud operators and HPC customers alike. The BP supercomputer has 2,912 HP ProLiant SL230s Gen8 server nodes, each with two eight-core "Sandy Bridge" Xeon E5-2600 v1 processors. Each of these nodes has 128 GB of memory. The cluster also has 50 DL580 rack server nodes, which with Xeon E7 "Westmere-EX" processors running at 2.3 GHz. These two sets of nodes appear to be moved over from the old to the new machine. Some 1,920 nodes using older "Westmere-EP" Xeon 5600 processors from Dell that were part of the old machine have been retired, and BP has brought in 2,520 more ProLiant SL230s Gen8 nodes, only this time they are configured with ten-core "Ivy Bridge-EP" Xeon E5-2600 v2 processors running at 3 GHz; each node has 128 GB of memory. The entire cluster has over 1 TB of main memory, which is considerably more than the 536 TB planned last year. With the exception of the rack servers, which are presumably head compute or storage nodes, the SL nodes are all half-width tray servers that slide into a 4U SL6500 enclosure that can hold eight nodes in a 4U rack space.The compute nodes have a total of 96,992 cores.

It is noteworthy that neither Total nor BP are using Nvidia Tesla GPU coprocessors or Intel Xeon Phi X86 coprocessors to speed up their simulations and models. Their application code does not, as yet, lend itself well to offloading to other kinds of processors.

BP was using Ethernet switches from Arista Networks to cluster the old machine together. Keith Gray, BP's manager of high performance computing, said in an email exchange that the new machine has Arista 7508E core switches with 40 Gb/sec pipes linking the racks together and 10 Gb/sec Ethernet top-of-rack switches linking the nodes together into the rack. The machines are fed data from Lustre-based clustered file systems from DataDirect Networks, with more than 11 PB of capacity and with Intel providing support for Lustre.

The new machine has SUSE Linux Enterprise Server 11 on its compute nodes, just like its predecessor.

The BP server cluster is installed at the Center for High Performance Computing, which is a three-story, 110,000 square foot building at BP's Westlake campus outside of Houston that has room to expand the cluster in the future. This facility has been designed to use 30 percent less power to cool the machines as they run, and is also built to withstand the strong storms that sometimes assault the Gulf Coast. The existing datacenter had peaked out on its power and cooling, BP said, which is why it decided to pour concrete on a new facility late last year.

Bootnote: EnterpriseTech caught up with Gray after the ribbon cutting at the new datacenter, and he gave us a little more information about the cluster and the applications that will run on them.

The cluster runs homegrown seismic imaging applications, mostly around migrations and noise attenuation, Gray explains. You look for oil underneath the ground by making the earth vibrate and then recording the vibrations and reflections of sound over a wide area. There are distortions as sound bounces around underneath the surface, and migration algorithms try to undo these distortions to get a better picture of the rock formations in the crust. Seismic noise attenuation similarly tries to scrub noise from the surface that interferes with the seismic signals created purposefully to probe the crust with sound to get a better resolution on the images for the underground rock formations. Together, these refined images make it more likely to see oil and gas deposits.

When BP announced it was building a new datacenter and a larger cluster to go into it last year, Gray said that the seismic applications in use at BP could scale up to around 30,000 cores. It has been nearly a year, and now Gray says that BP has been able to push some of its key applications up as far as 40,000 cores. Not every application scales that far, by the way, but many of them scale quite well, he says.

The cluster uses Grid Engine as its workload scheduler and has a bunch of homegrown systems management tools that have been cobbled together from various open source projects.

Those two Arista 7508E aggregation switches are new, and they provide two 40 Gb/sec links out to each rack. These two switches are using multi-chassis link aggregation to make them look like one big virtual switch. There is still oversubscription on those pair of 40 Gb/sec links out to the rack, says Gray, but he adds that "it is quite adequate and it balances well with the storage systems we are able to deploy right now." The 7000 series top of rack switches from Arista have four 40 Gb/sec uplinks and 48 10 Gb/sec ports for linking to servers.

BP has not adopted InfiniBand for its clustering. "For our codes, 10 Gb/sec Ethernet meets our requirements," says Gray. "Most of our need is around moving large blocks and getting the bandwidth. It is not quite as critical to have low latency for inter-process communication. The algorithms tend to do a fair amount of floating point work in between steps, and 10GE has been very acceptable and scalability has been very good. We will continue to pay attention to new technologies as they come out."

BP has not adopted Tesla GPU or Xeon Phi X86 coprocessors to try to boost the performance of its clusters. And Gray offers an explanation for this.

"We have some really good friends that have GPUs," Gray says of his peers in the energy sector. "At the moment, as far as I know, no one in oil and gas that I know of that is quite prepared for Intel Xeon Phi, but people are paying a lot of attention to it. The camp on GPUs is still interesting. For us, it is all around creating the business case for the creation of new capabilities. Some of our applications can deliver wall clock improvements with GPUs, but we have to look at the breadth of our application suite and look at the complexity of developing codes to use the GPUs. Researchers don't try to focus on every geophysical problem, but rather on what is the problem with this particular reservoir. They are going to be walking back and forth, collaborating with the geophysicists and geologists in the business unit, and they want to solve that problem as fast as they can. If we can give them tools to prototype ideas, that's what we are going to do. CUDA is a specialized language and it is quite appropriate for some things, but for us, we feel like it adds complexity to our research geophysicists' lives. And we have not developed a business case for deploying a large-scale GPU cluster."

That said, BP does have a test cluster with "brand new kit" that combines CPUs and GPUs, and it is being used to test new algorithms as well as parallel visualization projects.

As for future upgrades to the BP cluster, Gray says that the idea is to try to stay ahead of Moore's Law by expanding the cluster's size faster than an individual chip is increasing in raw performance. The upgrade cycle is driven by the CPU upgrade cycle, of course, and the plan is to do an upgrade every 12 to 18 months.