There are lots of stories to tell from the SuperComputing 15 conference here in Austin, but a clear overriding theme – in order to reach ‘Exascale’ (the fancy name given to where a supercomputer hits one ExaFLOP, 1018 FLOPS, in LINPACK), PCIe co-processors and accelerators are going to be a vital aspect in that. Looking at the Top500 supercomputers list that measures supercomputers in pure FLOPS, or the Green/Graph 500 lists that focus on FLOPS/watt and graph compute performance respectively (graph compute such as social networking linked lists of users (nodes) and relationships (edges)), it is clear that focused silicon is key to get peak performance and performance per watt. That focused silicon is often around NVIDIA’s Tesla high performance computing cards, FPGAs (field programmable gate arrays) from Xilinx or Altera, focused architecture designs such as PEZY, or Intel’s Xeon Phi co-processing cards.

We’ve reported on Xeon Phi before, regarding the initial launch of the first generation Knights Corner (KNC) with 6GB, 8GB or 16GB of onboard memory. These parts are listed on Intel's ARK at $1700 to $4150 (most likely less for bulk orders) but KNC forms the backbone of the compute behind the world’s number 1 supercomputer, the Tianhe-2 in China. Over the course of SC15, more details have emerged about the 2nd generation, Knights Landing (KNL) regarding availability and memory configuration, using up to 16GB of onboard high-bandwidth memory using a custom protocol over Micron’s HMC technology and eight onboard HMC memory controllers.

As part of a briefing at SC15, Intel had a KNL wafer on hand to show off. From this image, we can see about 9.4 dies horizontally and 14 dies vertically, suggesting a die size (give or take) of 31.9 mm x 21.4mm, or ~683 mm2. This die size comes in over what we were expecting, but comes in line with other predictions about the route of the first gen, Knights Corner, at least. Relating this to transistor counts, we have a differing story of Charlie Wuischpard (VP of Intel’s Data Center Group) stated 8 billion transistors to us at the briefing but there are reports of Diane Bryant (SVP / GM, Data Center Group) stated 7.1 billion at an Intel Nov ’14 investor briefing, but we can only find one report of the latter. This would come down to the wobbly metric of 10.4-11.7 million transistors per square millimeter.

The interesting element about KNL is that where the 1st generation KNC was only available as a PCIe add-in coprocessor card, KNL can either be the main processor on a compute node, or as a co-processor in a PCIe slot. Typically Xeon Phi has an internal OS to access the hardware, but with this new model it eliminates the need for a host node – placing a KNL in a socket will give it access to both 16GB of high speed memory (the MCDRAM) as well as six memory channels for up to 384GB of DDR4, at the expense of the Intel Omni-Path controller. The KNL will also have 36 PCIe lanes which can host two more KNC co-processor cards and another four for other purposes.

As you might expect, due to the differences we end up with the same die on different packages – one as a processor (top) and one as a co-processor which uses an internal connector for both data and Omnipath. Given the size of the die and the orientation of the cores in the slide above (we can confirm based on the die that it’s a 7x6 arrangement taking into account memory controllers and other IO), the fact that the heatspreader over the package is in a non-rectangular shape is due to the MCDRAM / custom-HMC high speed memory.

If we look into the side, the headspreader clearly does not go down onto the package all the way around, in order to deal with the on-package memory.

The connector end of the co-processor has an additional chip under the heatspreader, which is most likely an Intel Omni-Path connector or a fabric adaptor for adjustable designs. This will go into a PCIe card, cooler applied and then sold.

The rear of the processor is essentially the same for both.

When it comes to actually putting it on a motherboard, along with the six channels of DDR4, we saw Intel’s reference board on the show floor as a representation on how to do it. A couple of companies had it on display, such as Microway, but this is a base board for others such as Supermicro or Tyan to build on or provide different functionality:

The board is longer than ATX, but thinner. I can imagine seeing one of the future variants of this in a half-width 1U blade with all six channels and two add-in boards, pretty easily. This might give six KNL in a 1U implementation, as long as you can get rid of the heat.

The socket was fairly elaborate, and it would come across that Intel has a high specification on specific pressure for heatsinks . But it is worth noting that the board design does not have an Omni-Path connection, and thus for a server rack then either additional KNLs should be added or PCIe Omni-Path cards need to be added. But this setup should help users wanting to exploit Xeon Phi in a single 1P node and MCDRAM, running the OS on the Xeon Phi itself without MPI commands to use extra nodes. There seems to be a lot of the HPC community here at SC15 who are super excited about KNL.

Regarding availability of Knights Landing, we were told that Intel has made sales for installs in Q4, but commercial availability will be in Q1. We were also told to expect parts without MCDRAM as well.

Additional: On a run through the SC15 hall, I caught this gem from SGI. It's a dual socket motherboard, but because KNL processors have no QPI links and can only in 1P systems, this would be equivalent of two KNL nodes in one physical board. Unfortunately they had a plastic screen in front, which distorts the images.