In this article, we are going to explain why Intel’s Skylake-SP PCIe changes have had an enormous impact on deep learning servers. There are several vendors on the market advertising Skylake-SP “single root” servers that are not single root. With the Intel Xeon E5 generation, each CPU has a single PCIe root. Therefore saying that all GPUs are attached to a CPU was an easy way of determining if a server was single root or not. That is not the case with Intel Xeon Scalable.

The world of AI/ deep learning training servers is currently centralized around NVIDIA GPU compute architectures. As deep learning training is able to scale beyond a single GPU and into many GPUs, PCIe topology becomes more important. This is especially so if you want to use NVIDIA’s P2P tools for GPU-to-GPU communication. Intel Xeon chips are still the standard for deep learning training servers. With the Intel Skylake-SP generation, Intel increased the number of PCIe lanes available by eight. At the same time, the company’s PCIe controller architecture for deep learning servers made single root servers less feasible. Combined with CPU price increases, this means that Intel Xeon E5 V3/V4 systems are still widely sought after by customers for training servers.

We explored Single Root v Dual Root for Deep Learning GPU to GPU Systems previously, and now we are going to show why the new generation of CPUs creates a problem for the newest architectures.

Noticing Something is Different with Skylake-SP Single Root PCIe Servers

The Intel Xeon Scalable generation has been out for over a year. With the Skylake-SP launch, something strange happened. The number of single root deep learning training servers has plummeted. In the Silicon Valley, the Supermicro SYS-4028GR-TRT2 is perhaps one of the most popular deep learning training platforms we see in local data centers, yet the successor SYS-4029GP-TRT2 has been less popular. We covered the former in our DeepLearning11: 10x NVIDIA GTX 1080 Ti Single Root Deep Learning Server (Part 1) article.

When we looked at the next-generation, we noticed that the “4029” Skylake-SP version did not advertise that all GPUs would be under a single PCIe root complex. Here is a link to both:

Intel Xeon E5 V3/ V4: Supermicro SYS-4028GR-TRT2

Intel Xeon Scalable: Supermicro SYS-4029GP-TRT2

Supermicro uses large riser platforms to host the PLX PCIe switches and PCIe slots for GPUs to plug into. Both the Intel Xeon E5-2600 V3/ V4 and the Xeon Scalable versions utilize the Supermicro X10DRG-O-PCIE daughter board.

This daughter board takes CPU1 PCIe lanes and pipes them through the PEX PCIe switches. Each switch is the large and expensive 96-97 port PLX model. These switches each utilize 16 ports as a backhaul to CPU1 leaving 80 ports left over for GPUs. With each GPU claiming 16 ports, each of the two PCIe switches can handle five PCIe GPUs. That topology means that in both cases all 10 GPUs are attached to the same CPU.

We knew why the SYS-4028GR-TRT2 is advertised as a single root system:

Conversely, the newer Skylake SYS-4029GR-TRT2 is not:

Being clear, single root PCIe complexes for 8-10 GPUs are a highly sought after configuration for P2P and so the fact that the newer model excludes this language is intriguing. This is not a Supermicro specific or 10x GPU specific concern. Our Tyan Thunder HX GA88-B5631 Server Review 4x GPU in 1U ran into the same limitation due to the new Intel Xeon Scalable architecture.

After speaking to multiple vendors, we found that this is actually a change in the Skylake-SP generation of Intel Xeon Scalable CPUs and it has a big impact. It is also one that the AMD EPYC 7001 cannot remedy.

Getting into this Mesh

One of the biggest changes with this generation is how the various parts of the Intel Xeon Scalable die are connected. We did an in-depth look at this in our piece: New Intel Mesh Interconnect Architecture and Platform Implications. For our purposes, the key change was moving from the company’s ring architecture to a mesh architecture to support higher core counts.

We actually picked up on the nuance of this change last year but missed the implication. Specifically, what this means to PCIe topology. We focused on this in our first mesh article:

If you look at the Broadwell-EP (Intel Xeon E5-2600 V4) architecture, all of the integrated IO (IIO) is on a single controller in the ring. We are going to use a low-core count example here so you can see the two PCIe x16 and single PCIe x8 controller on the top right ring stop. Single root servers would typically use one of these PCIe 3.0 x16 controllers to connect to a PCIe switch, then connect to half the GPUs. The other PCIe 3.0 x16 controller would connect to another PCIe switch and the other half of the GPUs. Since everything is in a single IIO root, this shared a PCIe root and for deep learning, everything worked well.

Since there is one IIO component, the industry colloquially equated one CPU to mean one PCIe root. It was this way for four generations of Intel Xeon E5 CPUs.

Now let us take the Intel Skylake-SP 28-core die used on high-end CPUs like the Intel Xeon Platinum 8180 and Xeon Platinum 8176. As you can see there are three external PCIe 3.0 x16 hops atop the mesh. There is a fourth labeled “On Pkg PCIe x16” which can be used for on package components such as Omni-Path “F” series parts. For our purposes, the other three are the ones we are interested in. Note that they all have their own stop on the mesh.

Digging a bit deeper, we can see why these PCIe x16 controllers are different. Here is what the Processor Integrated I/O looks like on the Skylake-SP mesh where you can see that each PCIe 3.0 x16 controller sits on its own mesh stop:

Instead of a single big IIO block, these are split into multiple smaller IIO blocks, each with their own traffic controllers and caches. For deep learning servers, that means that each PCIe x16 controller that is connected to a 96 port PCIe switch and downstream GPUs is on a different mesh interconnect hop. For more technical PCIe enumeration and root complex reasons that we are not going into here, there is an implication. The architecture is not single root like the Intel Xeon E5 generation where all controllers sat in the monolithic IIO ring hop.

Next, we are going to put the pieces together and show how all of this means. We are also going to discuss AMD EPYC Naples and how that can fit into the process. Finally, we are going to give some of our closing thoughts.