Over the last couple of years, through funding fromÂ NEDO, PEZY has been designing a series of many-core MIMD processors known as theÂ PEZY-SCx family.Â Last week, the small Japanese firm inÂ collaboration with ExaScaler, announced that they have once again reached the number one spot on the Green500 list. This time with theÂ ZettaScaler-2.2 supercomputer. The company had previously reached the number one spot in June 2015Â and June 2016Â with theirÂ 1,024-core PEZY-SC andÂ PEZY-SCnp processors.

Powering the ZettaScaler-2.2 is the PEZY-SC2. The SC2 is a second-generation chip featuring twice as many cores – i.e., 2,048 cores with 8-way SMT for a total of 16,384 threads. Operating at 1 GHz with 4 FLOPS per cycle per core as with the SC, the SC2 has a peak performance of 8.192 TFLOPS (single-precision). Both prior chips were manufactured on TSMC’s 28HPC+, however in order to enable the considerably higher core count within reasonable power consumption, PEZY decided to skip a generation and go directly to TSMC’sÂ 16FF+ Technology.

The SC incorporated two ARM926 cores and while that was sufficient for basic management and debugging its processing power was inadequate for much more. The SC2 uses a hexa-coreÂ P-Class P6600 MIPS processor which share the same memory address as the PEZY cores, improving performance and reducing data transfer overhead. With the powerful MIPS management cores, it is now also possible to entirely eliminate the Xeon host processor. However, PEZY has not done so yet.

Feeding the Beast

One of the bigger changes PEZY has made is getting rid of the “prefecture” units which were used as synchronization units for preparing the L3. In the SC, the chip was divided into four “prefectures”, each containing 16 “cities” for a total of 256 cores and their own L3 cache.

The SC2 eliminates all of this and instead introduces a unified last level cache (LLC) which is shared by all the cores as well as the six MIPS64 cores. Additionally, half of the memory controllers were removed on the SC2 and the remaining four were upgraded to support 64-bit DDR4-3200 for anÂ aggregated memory bandwidth of 102.4 GB/s.

First Use of TCI

In place of the four controllers that were removed, they added four custom TCI ports. ThruChip Interface (TCI) is an alternative 3D packaging interconnect technology toÂ through-silicon via (TSV) developed atÂ Keio University in Japan. Instead of usingÂ vertical interconnect access to connect multiple dies, TCI is a wireless near-field inductive coupling technology. That is, TCI uses a magnetic field to penetrate through a semiconductor without a need for a physical medium like a throughÂ electrical conductor.

The SC2 has four custom TCI-DRAM interfaces, allowing it to achieveÂ extremely high bandwidth of 512 GB/s per port for a total aggregated bandwidth of 2 TB/s. PEZY uses aÂ TCI 3D DRAM chip called theÂ UM-1Â that is being developed by an affiliated company UltraMemory. UltraMemoryÂ was founded in November 2013 for the purpose of designing ultra-wide DRAM using TCI. This is the first time a commercial chip has utilized TCI technology and going purely based on the “UM-1” part number, we can speculate that this is also UltraMemory’s first model based on this technology as well.

To accommodate 2,048 cores, the number of cities were doubled to 128. The new high-level block diagram of the PEZY-SC2 chip should look similar to this:

Low-precision for Deep Learning

In areas such as deep learning and AI, high-precision calculations are not always necessary. The PEZY-SCÂ did not have support for 16-bit floating point operations. With the SC2, the processing elements were enhanced by adding support for 16-bit half precision floating point arithmetic in an attempt to increaseÂ adaptability in the field of deep learning.

ZettaScaler-2.2

Announced last week, theÂ ZettaScaler-2.2 (Gyouyou) featuresÂ 7,056Â PEZY-SC2Â chips operating at a lower frequency of 700 MHz along with 45 W TDP 16-coreÂ Intel Xeon D host processors. Earlier this year, PEZY reported the die size to be roughly 620 mmÂ², meaning yield problems would be a definite concern. Although those chips integrate 2,048 cores, only 1,984 cores are active in order to improve yield. This puts the peak performance of the currently installed chips at 5.555 TFLOPS for single-precision, 33% less computational power than the theoretical maximum. The ZS-2.2 has a Linpack performance of 14.13 PFLOPS with aÂ theoretical peak performance of 19.89 PFLOPS. The system consumedÂ 962.3kW, putting its performance per watt at 14.69 GFLOPS/W, surpassing the 14.11 GFLOPS/W of the TSUBAME 3.0, placing them at rank 1 on the Green500 list. The next Top500 list is scheduled to be announced on November 12 at the 2017Â SuperComputing Conference (SC17) which will be held in Denver, Colorado.

For every 8 PEZY-SC2 chips, there is a single 16-core Xeon D processor. WithÂ 7,056 SC2 chips, we’re looking atÂ 882 Xeon D chips for a total ofÂ 14,013,216 cores. It’s worth noting that withÂ just over fourteen million cores, the ZettaScaler-2.2 will have the highest core count of any supercomputer in the Top500, surpassing theÂ Chinese supercomputer Sunway TaihuLightÂ by over 3 million cores. Keep in mind that the performance submission deadline for the Top500 was November 1 at 23:59 Pacific Time, so it is entirely possible PEZY managed to further performance tune the system since the original October 26 announcement.

The high efficiency achievement can be attributed to a number of things including the move to the more energy efficient 16nm FinFet process as well as the novel use of liquid immersion cooling which reduces the chip temperature, consequently reducing leakage current. The ZettaScaler is anÂ incredibly dense system. In the photo above there are 26Â liquid immersion cooling tanks. However at this time, with 7,056 PEZY-SC2 chips, only 13.8 are filled, meaning the current system is operating at half of its capacity.

Future Roadmap

PEZY is not planning on stopping any time soon. Earlier this year the company laid out their future roadmap which extends into the 2020s.

PEZY-SCx Roadmap Chip PEZY-SC PEZY-SC2 PEZY-SC3 PEZY-SC4 Process 28nm 16nm 7nm 5nm Die 412 mmÂ² 620 mmÂ² 700 mmÂ² 740 mmÂ² Cores 1,024 2,048 8,192 16,384 Voltage 0.9 V 0.8 V 0.65 V 0.55 V Clock 733 MHz 1 GHz 1.33 GHz 1.6 GHz Wide-IO N/A 4 x 1,024 bit 8 x 2,048 bit 8 x 4,096 bit Peak Wide-IO 2.1 TB/s 12.2 TB/s 24.4 TB/s Efficiency 6.7 GFLOPS/w 15 GFLOPS/w 40 GFLOPS/w 60 GFLOPS/w

The PEZY-SC3 will be introduced with theÂ ZettaScaler-3.0 supercomputer in late 2019. PEZY expects to the system to exceed 1 EFLOPS. With the help ofÂ ExaScaler, PEZY hopes toÂ expand the system to about 100 cooling tanks which should give you an idea how just how many of those chips they intend on using. Even with existing PEZY-SC2 processors, that’s sufficient to supportÂ over 100 million cores. PEZY also hopes to widen their TCI-DRAM interfaces and double the number of ports in order to increase their memory bandwidth by a tenfold by the time the PEZY-SC4 is introduced. Both the SC3 and SC4 are expected to replace the standard PCIe controllers with silicon photonics (likely optical PCIe). In addition to those features, PEZY has been considering the use of multi-die chips in order to further increase the number of cores.

Whether PEZY will succeed with their highly aggressive roadmap remains to be seen. Nonetheless, this is one company really worth keeping an eye on!