A massive lineup

Intel prepares Skylake-SP for the data center.

The amount and significance of the product and platform launches occurring today with the Intel Xeon Scalable family is staggering. Intel is launching more than 50 processors and 7 chipsets falling under the Xeon Scalable product brand, targeting data centers and enterprise customers in a wide range of markets and segments. From SMB users to “Super 7” data center clients, the new lineup of Xeon parts is likely to have an option targeting them.

All of this comes at an important point in time, with AMD fielding its new EPYC family of processors and platforms, for the first time in nearly a decade becoming competitive in the space. That decade of clear dominance in the data center has been good to Intel, giving it the ability to bring in profits and high margins without the direct fear of a strong competitor. Intel did not spend those 10 years flat footed though, and instead it has been developing complimentary technologies including new Ethernet controllers, ASICs, Omni-Path, FPGAs, solid state storage tech and much more.

Our story today will give you an overview of the new processors and the changes that Intel’s latest Xeon architecture offers to business customers. The Skylake-SP core has some significant upgrades over the Broadwell design before it, but in other aspects the processors and platforms will be quite similar. What changes can you expect with the new Xeon family?

Per-core performance has been improved with the updated Skylake-SP microarchitecture and a new cache memory hierarchy that we had a preview of with the Skylake-X consumer release last month. The memory and PCIe interfaces have been upgraded with more channels and more lanes, giving the platform more flexibility for expansion. Socket-level performance also goes up with higher core counts available and the improved UPI interface that makes socket to socket communication more efficient. AVX-512 doubles the peak FLOPS/clock on Skylake over Broadwell, beneficial for HPC and analytics workloads. Intel QuickAssist improves cryptography and compression performance to allow for faster connectivity implementation. Security and agility get an upgrade as well with Boot Guard, RunSure, and VMD for better NVMe storage management. While on the surface this is a simple upgrade, there is a lot that gets improved under the hood.

We already had a good look at the new mesh architecture used for the inter-core component communication. This transition away from the ring bus that was in use since Nehalem gives Skylake-SP a couple of unique traits: slightly longer latencies but with more consistency and room for expansion to higher core counts.

Intel has changed the naming scheme with the Xeon Scalable release, moving away from “E5/E7” and “v4” to a Platinum, Gold, Silver, Bronze nomenclature. The product differentiation remains much the same, with the Platinum processors offering the highest feature support including 8-sockets, highest core counts, highest memory speeds, connectivity options and more. To be clear: there are a lot of new processors and trying to create an easy to read table of features and clocks is nearly impossible. The highlights of the different families are:

Xeon Platinum (81xx) Up to 28 cores Up to 8 sockets Up to 3 UPI links 6-channel DDR4-2666 Up to 1.5TB of memory 48 lanes of PCIe 3.0 AVX-512 with 2 FMA per core

Xeon Gold (61xx) Up to 22 cores Up to 4 sockets Up to 3 UPI links 6-channel DDR4-2666 AVX-512 with 2 FMA per core

Xeon Gold (51xx) Up to 14 cores Up to 2 sockets 2 UPI links 6-channel DDR4-2400 AVX-512 with 1 FMA per core

Xeon Silver (41xx) Up to 12 cores Up to 2 sockets 2 UPI links 6-channel DDR4-2400 AVX-512 with 1 FMA per core

Xeon Bronze (31xx) Up to 8 cores Up to 2 sockets 2 UPI links No Turbo Boost 6-channel DDR4-2133 AVX-512 with 1 FMA per core



That’s…a lot. And it only gets worse when you start to look at the entire SKU lineup with clocks, Turbo Speeds, cache size differences, etc. It’s easy to see why the simplicity argument that AMD made with EPYC is so attractive to an overwhelmed IT department.

Two sub-categories exist with the T or F suffix. The former indicates a 10-year life cycle (thermal specific) while the F is used to indicate units that integrate the Omni-Path fabric on package. M models can address 1.5TB of system memory. This diagram above, which you should click to see a larger view, shows the scope of the Xeon Scalable launch in a single slide. This release offers buyers flexibility but at the expense of complexity of configuration.

The Intel Xeon Scalable Processor Feature Overview

Though the underlying architecture of the Skylake-SP release we see today shares a lot with the Skylake-X family launched for the HEDT market last month, there are quite a few differences in the platform.

Features like 8-socket support, Omni-Path support, four 10GigE connections built into the chipset, VMD for storage, QuickAssist Technology stand out.

Xeon Scalable processors will work in 2S, 4S and 8S configurations, with the third UPI link (replacement for QPI) optional on the dual and quad configurations. The added UPI interface will lower latency between cores requiring one less hop between sockets while increasing bandwidth for improved scalability.

This slide shows the feature differences of the previous generation Xeon and the new Xeon Scalable family. We have core count increases from 22 to 28 cores, additional PCIe lane availability, 50% more memory channels running at a higher frequency but also a higher TDP range, going all the way up to 205 watts. With the recent discussion swirling around the heat of the Skylake-X processors, you have to wonder how Intel will be addressing this concern with high core count parts for the data center.

Through a combination of architectural tweaks, including an improved branch predictor, higher throughput instruction decoder, better scheduling and larger buffers, Intel estimates that integer performance should see around a 10% improvement at the same clock speed when compared to Broadwell-EP.

AVX-512 is a big part of the improvement in floating point performance with Skylake-SP, offering 512-bit wide vector acceleration and double the peak FLOPS throughput, single and double precision, of the prior generation. Software needs to be coded directly for this new instruction set of course, and enterprise software can often lag due to compatibility concerns, so the benefit of AVX-512 will be judged over a longer period.

Interestingly, the physical layout of the Skylake core has been bolted onto for the Skylake-SP release, with the diagram above being close to scale based on our conversations with Intel. The added 768KB of L2 cache was added outside the main core, which does add a clock or two of latency.

We already know that AVX code forces a CPU core to use more power but with Skylake-SP Intel has implemented the ability for each core to dynamically adjust its Turbo state based on the code being executed on it.

Along with the improvements in scalability that the mesh architecture offers Skylake-SP, Intel has distributed the cache and home agents across each of the core nodes. This gives the processor the ability to have more memory requests in flight simultaneously compared to the previous generation that had only a few QPI agents.

I already talked about the cache redesign with Skylake-X and SP and even detailed some the performance difference that it accounts for. This new design has a smaller L3 cache but larger L2 for better thread-memory locality while moving to a non-inclusive design. The benefits for the data center include improved virtualization performance with a larger private L2 cache per core and a reduction in uncore activity due to a fewer external cache memory requests.

This graph shows the relative change in cache misses from Broadwell-EP to Skylake-SP in various workloads. Intel is upfront that in some cases, the L3 cache misses are in fact worse than Broadwell, but when the L2 cache hits improve, they can improve DRASTICALLY. Look at a workload like POV-Ray that has essentially been reduced to zero cache misses thanks to the larger capacity of the L2.

The memory system on Skylake-SP gets a significant upgrade this generation, moving from a quad-channel to a 6-channel memory controller, with many processor SKUs offering DDR4-2666 support as well. This equates to a 60% improvement in total memory bandwidth availability per socket compared to Broadwell-EP. By utilizing the mesh architecture and breaking the memory controller into two 3-channel segments at opposite sides of the die, Intel could lower the distance and number of hops (on average) between any single core and memory.

Replacing QPI is the UPI, Ultra Path Interconnect. With a combination of improved messaging efficiency and higher clock rates, the UPI links can go as high as 10.4 GT/s, up from 9.6 GT/s on QPI. (Though some of the new CPUs will still run at 9.6 GT/s.)

Intel VMD, Volume Management Device, is a new capability for Skylake-SP that enables a way to aggregate multiple NVMe SSDs into a single volume while enabling other features like RAID at the same time. This occurs outside of the OS, handling all enumeration and events/errors inside the domain. I am working with Allyn to do a deeper dive on what this means for enterprise and data center storage so hopefully we will have more details soon.

Optimized Turbo profiles on Skylake-SP allow the CPU to run at higher frequencies that the previous generation by moving away from the simplistic “one clock bin per active core” algorithm. This basically means that the CPU will be running at higher frequencies more often when partially loaded, but doesn’t change the fundamental behavior under a full load.

This is also the first time that Speed Shift is introduced at the data center level, bringing faster clock speed changes to the server markets. While in the consumer space this feature is meant to improve the responsiveness of a notebook or PC, for the server market it allows for better control over the energy/performance metrics

By using software to moderate the aggressiveness of the clock speed increases, OEMs can offer slower or faster clock and power draw ramps depending on the workload and the requirements it needs.

Available only on the Platinum and the 61xx Gold series of processors, Intel offers a handful of Xeon parts with integrated Omni-Path fabric on package, utilizing a dedicated 16 lanes of PCIe 3.0 for connectivity. This is a physical connector on the processor itself, and will require specific platform development, though a board can be designed to use both types of processors.