NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

The architecture

It contains 60 cores with a vector-width of 512 bits (8 times 64 bits). This means that per clock-tick it can do one computation on a 8-wide vector of double precision floats on each of the 60 cores (SIMD). Compare this to an AMD card, which has several hundreds of cores with support for 4-wide vectors of single precision floats (VLIW). At 1.053 GHz this gives 1.050 * 60 * 8 * 2 = 1011 GFLOPS.

The above 2 is because it is capable of doing MAD-operations: a multiply + an add. This means that if you have a multiply-operation, you can get an add free – if not, then you get 0.5 TFLOPS only. For more information, check “Fused Multiply-Add” on page two of Differences in floating-point arithmetic between Intel® Xeon® processors and the Intel® Xeon Phi™ coprocessor [PDF].

Most interesting would be to know how good the scheduler is implemented. If there is one (full) scheduler per core, then the Phi will be much easier to program than an accelerator of AMD or NVIDIA. Do note that upcoming architectures of the two GPU-vendors are much more advanced in this criterion.

There is no official information that single precision is double the performance of double precision – clear is that they focus on double precision. It has a strong focus on cache-sizes (± 1.8 MB L1, ± 30 MB L2 cache per core (?)) and a high memory bandwidth (320 GB/s → ±5.33 GB per core) – both will increase programmability of the accelerator. This makes it easier to write code that runs at 70% or better.

The Phi is special in more ways. When the Phi was still called the Knights Corner, it was mentioned that it is pre-loaded with an embedded Linux. This means it is an computer on its own. You can read more about it here.

Knowing this capability of the Phi, it is strange it is strongly positioned to be used with a strong CPU. Also for future releases Intel focuses its system-architecture on combining an Intel Phi next to an (Intel) CPU (see image blue is CPU, yellow is Phi).

This is a different approach than what is popular with other chip-designers, which try to find ways to put the accelerator on the same die as the CPU. But as the interconnect-war is currently heating up, we cannot draw any conclusion from this. Think of the various ways the 386/486 co-processors could be connected to the motherboard/CPU – also this time nothing has be decided yet.

Programming models

Intel chose the safest way to attract as many developers as possible: support all models. This list could be decreased for the sake of vendor lock-in, but for now we can enjoy it. The below image is taken from this PDF. Of course OpenCL is in it too.

Performance comparison

A comparison to both competitors. There are many, many sources all claiming different things. Will therefore update this tables a lot the coming time.

Tesla K20:

Functionality XEON Phi 5110P TESLA K20 GPU-Processor count 1 1 Architecture X87 Knights Corner Kepler GK110 Memory per GPU-processor up to 8 GB GDDR5. No ECC! 6GB GDDR5 (5 GB w/ ECC) Memory bandwidth per GPU-processor 320 GB/s 200 GB/s Performance (single precision, per GPU-proc.) 2.022 TFLOPS 3.52 TFLOPS Performance (double precision, per GPU-proc.) 1.011 TFLOPS 1.17 TFLOPS Max power usage per GPU-processor 225 Watt 225 Watt Greenness (SP) 8.99 GFLOPS/Watt ? 15.6 GFLOPS/Watt Bus Interface PCIe 3.0 x16 PCIe 3.0 x16 Price (per GPU-processor) $2649 $3199 Price per GFLOPS (SP) $2.62 $0.77 Price per GFLOPS (DP) $2.62 $2.73 Cooling Passive (P, A = Active) Passive

FirePro S9000 – see next article for the S10000:

Functionality XEON Phi 5110P FirePro S9000 GPU-Processor count 1 1 Architecture X87 Knights Corner Graphics Core Next Memory per GPU-processor up to 8 GB GDDR5. No ECC! 6GB GDDR5 (5 GB w/ ECC) Memory bandwidth per GPU-processor 320 GB/s 264 GB/s Performance (single precision, per GPU-proc.) 2.022 TFLOPS 3.230 TFLOPS Performance (double precision, per GPU-proc.) 1.011 TFLOPS 0.806 TFLOPS Max power usage per GPU-processor 225 Watt 225 Watt Greenness (SP) 8.99 GFLOPS/Watt 14.35 GFLOPS/Watt Bus Interface PCIe 3.0 x16 PCIe 3.0 x16 Price (per GPU-processor) $2649 $2500 Price per GFLOPS (SP) $2.62 $0.77 Price per GFLOPS (DP) $2.62 $3.10 Cooling Passive (P, A = Active) Passive

Sources for the Phi-specifications are below.

As not all information is public, no conclusions can be drawn yet. Follow us on Twitter or LinkedIn to get noticed of any update of this article and other interesting information.

Sources

Blogs of people at Intel who are into Phi and OpenCL