Following an exclusive report from SemiAccurate, and confirmed by Intel through ServeTheHome, the news on the wire is that Intel is set to can wide-spread general availability to its Cooper Lake line of 14nm Xeon Scalable processors. The company is set to only make the hardware available for priority scale-out customers who have already designed quad-socket and eight-socket platforms around the hardware. This is a sizeable blow to Intel’s enterprise plans, putting the weight of Intel’s future x86 enterprise CPU business solely on the shoulders of its 10nm Ice Lake Xeon future, which has already seen significant multi-quarter delays from its initial release schedule.

Intel’s Roadmaps Gone Awry

In August 2018, Intel held a Data-Centric Innovation Summit, where the company laid out its plans for the Xeon CPU roadmap. Fresh off the recent hires of Raja Koduri and Jim Keller over the previous months, the company was keen to put a focus on Intel as being more focused on the ‘data-centric’ part of the business rather than just being ‘PC-centric’. This meant going after more than just the CPU market, but also IoT, networking, FPGA, AI, and general ‘workload-optimized’ solutions. Even with all the messaging, it was clear that Intel’s high market share in the traditional x86 server business was a key part of their current revenue stream, and the company spent a lot of time talking about the CPU roadmap.

At the time, the event was the first year anniversary of Skylake Xeon, which launched in mid-2017 (the consumer parts in Q3 2015). The roadmap, as laid out at this event was to launch Cascade Lake 2nd Generation Xeon Scalable by the end of Q4 2018 on 14nm, Cooper Lake 3rd Gen Xeon Scalable in 2019 on 14nm, and Ice Lake 4th Gen Xeon Scalable in 2020 on 10nm.

Cascade Lake eventually hit the shelves in April 2019, and took a while to filter down to the rest of the market.

Cooper Lake, on the other hand, was added to the roadmap rather late in the product cycle for 14nm. With Intel’s known 10nm delays, the company decided to add in another range of products between Cascade Lake and Ice Lake, with the new key feature between the two being support for bfloat16 instructions inside the AVX-512 vector units.

Why is BF16 Important?

The bfloat16 standard is a targeted way of representing numbers that give the range of a full 32-bit number, but in the data size of a 16-bit number, keeping the accuracy close to zero but being a bit more loose with the accuracy near the limits of the standard. The bfloat16 standard has a lot of uses inside machine learning algorithms, by offering better accuracy of values inside the algorithm while affording double the data in any given dataset (or doubling the speed in those calculation sections).

A standard float has the bits split into the sign, the exponent, and the fraction. This is given as:

<sign> * 1 + <fraction> * 2<exponent>

For a standard IEEE754 compliant number, the standard for computing, there is one bit for the sign, five bits for the exponent, and 10 bits for the fraction. The idea is that this gives a good mix of precision for fractional numbers but also offer numbers large enough to work with.

What bfloat16 does is use one bit for the sign, eight bits for the exponent, and 7 bits for the fraction. This data type is meant to give 32-bit style ranges, but with reduced accuracy in the fraction. As machine learning is resilient to this type of precision, where machine learning would have used a 32-bit float, they can now use a 16-bit bfloat16.

These can be represented as:

Data Type Representations Type Bits Exponent Fraction Precision Range Speed float32 32 8 23 High High Slow float16 16 5 10 Low Low 2x Fast bfloat16 16 8 7 Lower High 2x Fast



BF16:





FP32



Images from Wikipedia

As far as we understand, there was one big customer for bfloat16: Facebook.

Cooper Lake Set For PrimeTime, Or Not

Aside from bfloat16 support, we were told that Cooper Lake was to share a socket with Ice Lake, which should mean eight-channel memory support based on data provided by Intel’s partners.

In August 2019, Facebook detailed its Zion Unified Training Platform (ZION) in use inside its datacenters. The platform was designed to deal with machine learning training algorithms with sparse datasets, and in the presentation the company said that it expects that data to grow 3x year on year, and as such they were building specific training systems to meet these requirements. Part of that angle was software and hardware co-design, such that the hardware was useable as fast as possible but also the software was able to do what was required in order to achieve the results.

In order to achieve some of this, Facebook stated during its presentation that it was using a unified bfloat16 format across both CPUs and accelerators in order to take advantage of both high capacity DDR but also high-bandwidth HBM. By keeping all the data in the same format as it is transferred between the CPU fabric and the accelerator fabric, it saved power and time rather than managing the conversions. It also allows the software stack to not have to worry whether it is working on a CPU or an accelerator, regardless of what number standard the training was actually being done in.

Note, when this presentation was made, in August 2019, there was no CPU on the market that supported bfloat16. In the Q&A session, I asked a couple of questions: firstly, to clarify that the processor has to have BF16 support (answer: that is how the specification is defined), and secondly which CPU was in use. Misha Smelyanskiy from Facebook, presenting the topic, stated that he could probably mention which CPU they were using, but to be on the safe side, he wasn’t going to mention, but he did at least point out that Intel had made an announcement about supporting the standard (that’s code for using Intel, given how close the two companies work together).

You can see the video of that exchange here:

The scope of the Zion platform is built around dual socket blades acting in a glueless-logic fashion as an eight-socket system. A 4U accelerator system would be attached to these sets of blades, creating a ZION in about 8U of server space. Each ZION would be connected in a hypercube mesh to other ZIONs.

It’s important to note that the Facebook presentation was talking about systems they were already working on in their datacenters. They already had Cooper Lake silicon up and running. Given that we expect Cooper to be very similar to Cascade, this wasn’t a surprise – supporting bfloat16 in current hardware should only require a small hardware change and a firmware change at most.

Normally Intel works with its lead partners on upcoming silicon. From our best estimates, there are various timeframes on how these relationships work. If we consider the official launch date as when we consider the chips on ‘general availability’, then priority partners are likely to start getting silicon up to 12 months in advance of that date.

This 12-month-before-launch silicon is typically early engineering sample (ES) silicon, with the potential to have bugs and likely only working at around 1 GHz – or might not have the memory controllers working properly. With this hardware, the priority partner can start to plan their systems and build their software to be optimized for the hardware coming later. Sometimes these CPUs make their way onto eBay and such, which is why we sometimes see big CPUs at low frequencies being sold off cheap.

Around six months before launch, the priority partners are likely to have qualification sample (QS) units, which are essentially near-final to launch. A lot of these QS units actually go into deployed systems with those priority partners, especially on internal projects that aren’t public facing. This means that partners like Google, Facebook, Microsoft, Tencent and others could be using +1 generation hardware on their back-end services and the public (even the tech public) don’t even know about it.

In this instance, Facebook has had Cooper Lake silicon in-house for a long time. They already had it in August 2019, and given that Intel originally said that Cooper Lake was set for a 2019 launch, we expect the silicon in hand at Facebook to be near-final QS silicon, and the launch was only a short way away. Intel had stated as far back as May 2019 that Cooper Lake (and Ice Lake) were already sampling with customers.

2019 ended and Cooper Lake was nowhere in sight. At Intel’s 2019 investor meeting, Cooper Lake had shifted into a 2020 time slot. This is despite Intel promising a cadence speedup between new platforms.

Cooper Lake Canned (Well, It Is For You and Me)

Today’s news is that Intel is pulling the plug on Cooper Lake. That’s despite being a product that was ES sampling as early as 18 months ago, potentially QS sampling 12 months ago, and should be out already. If you wanted a single socket or a dual socket Cooper Lake server, then bad luck – Intel is set to only sample Cooper Lake to key customers (Facebook) who are driving quad-socket and eight-socket systems.

As reported at ServeTheHome, Intel gave the following guideance. We’ve split it into several segments to discuss what is being said.

Given the continued success of our recent expansion of 2nd Gen Xeon Scalable products, in addition to customer demand for our upcoming 10nm Ice Lake processors, we have decided to narrow the delivery of our Cooper Lake products that best meets our market demand.

Intel’s upcoming Cooper Lake processors will be supported on the Cedar Island platform, which supports standard and custom configurations that scale up to 8 sockets.

Customers, including some of the largest AI innovators today, are uniquely interested in Cooper Lake’s enhanced DL Boost technology including the industry’s first inclusion of bfloat16 instruction processing support. We expect strong demand for the technology and processing capability with certain customer segments and AI usages in the marketplace that support deep learning for training and inference use cases.

We continue to expect delivery of Cooper Lake starting in the first half of 2020.

This is the meat of the announcement – it essentially reads that if you’re a key customer for Cooper Lake already, then you’ll get it by the end of Q2 this year. These key customers are obviously big players and will want thousands of systems, which is where Intel sees ‘strong demand’.

Intel constantly evaluates our product roadmaps to ensure we are positioned to deliver the best silicon portfolio for data center platforms.

Intel’s upcoming 10nm Ice Lake processors will be introduced on the upcoming Whitley platform.

Intel remains on track for delivery of 10nm Ice Lake CPUs later this year.

This is Intel’s ‘we have the right to adjust our roadmaps as we see fit’ clause. If I was reading between the lines, there might be an upside to these three statements – Intel might be more confident in Ice Lake than most people expect. Either that, or Intel is set to put all of its enterprise CPU eggs into a single 10nm basket.

With Intel narrowing the scope of Cooper Lake to key customers, I highly doubt that we’re going to get samples for review.

Personal Thoughts on Cooper Lake

Personally, I’m of the opinion that Cooper Lake wasn’t actually meant to exist, at least not as a named part. We already suspect that Cooper Lake was added into the roadmap to ease the transition between Cascade Lake and Ice Lake, and that made it feel like a bit of a knee-jerk reaction to Intel’s woes. But we know that Intel already does a number of custom silicon designs for its partners, or silicon with custom firmware to get access to parts of the die that others don’t. However, I’m not entirely sure that it was intended to be a whole family of products. Intel could have happily supplied Cascade Lake silicon to Facebook (and others) that wanted BF16, and it was just enabled through firmware. No-one outside of those companies would have known about it, and we’d all be on Ice Lake by now (or Facebook would have been on BF16 versions of Ice Lake, which could potentially exist).

But because of the delay, this custom edition CPU became a platform all on its own. Intel had to gear up a design for mass market, including adjusting the memory channels and the socket. It was no longer just a Cascade-Plus part, but a whole new platform. That required Intel’s partners to redesign their stack to compensate (if they didn’t know already).

But now Intel has cut the high-volume parts of the Cooper Lake story away, leaving only the core for its key partners. Which again, brings me back to my original theory – these Cooper Lake parts were only meant to be a Cascade-Plus type design for key customers. Similar to how process node advancements in-between nodes weren’t publicised (e.g. several improvements on 45nm before 32nm to get better voltage/frequency), but nowadays they are, Cooper Lake kind of feels like that. Except now it’s going back into the shadows. Who knows how many other custom variants of Intel’s CPUs with different instructions, or different QoS policies, or different RAS options, exist. Intel doesn’t talk about them, unless it needs to.

Focusing on Ice Lake 2020 #IceLake2020

If we consider Cooper Lake done and dusted for the mass market, the attention turns to Ice Lake, and Intel’s ability to execute on 10nm. Intel technically showed off what it called an Ice Lake CPU in December 2018 at its Architecture Day, but as we’ve seen in the mobile and desktop space, 10nm is currently having a hard time.



Sailesh Kottapalli of Intel, showing an Ice Lake Xeon CPU, which was set to share a socket with Cooper Lake

There have repeated reports about Intel’s Ice Lake Xeon delays, some as recent as December 2019, saying that the platform has been delayed more and more, putting into doubt as to whether Intel can get general availability for Ice Lake Xeon inside 2020. There are also discussions about core counts, frequencies, power, and whether Intel will have to move to a dual-die strategy for Ice Lake in order to maintain core count pace with other x86 and Arm competition, who are hitting 64 cores per socket.

Intel’s CFO George Davis has already stated Intel’s position on its 10nm portfolio, about how the financial bottom line of the platform is likely to be lower than 14nm and 22nm, and how the company is looking to its 7nm process to regain parity with the competition. If there are subsequent delays to Ice Lake, or the platform is going to look substantially different compared to the monolithic designs we’ve all been expecting to this point, then Intel is going to take hit to both adoption of the new platform and its bottom line.

Customers moving from old Broadwell Xeon (Q1 2016) systems or Skylake Xeon (Q3 2017), who normally have a 3-5 year update cycle, are desperately looking at something to update to.

Sources: SemiAccurate, ServeTheHome