From WikiChip

Edit Values POWER9 µarch General Info Arch Type CPU Designer IBM Manufacturer GlobalFoundries Introduction August, 2017 Phase-out 2020 Process 14 nm Core Configs 4, 8, 12, 16, 20, 24 Pipeline Type Superscalar OoOE Yes Speculative Yes Reg Renaming Yes Stages 12-16 Instructions ISA Power ISA v3.0B Cache L1I Cache 32 KiB/core

8-way set associative L1D Cache 32 KiB/core

8-way set associative L2 Cache 512 KiB/core duplex

8-way set associative L3 Cache 10 MiB/core duplex

20-way set associative Cores Core Names Sforza,

Monza,

LaGrange Succession POWER8+ POWER10

POWER9 is IBM's successor to POWER8, a 14 nm microarchitecture for Power-based server microprocessors first introduced in the 2nd half of 2017. POWER9-based processors are branded under the POWER family.

Code names [ edit ]

IBM introduced three flavors of POWER9.

SoC Codename SoC Description Module Memory Channels PCIe XBUS OpenCAPI Nimbus Scale Out Sforza 4 48 1 ✘ Monza 8 34 1 48 LaGrange 8 42 2 16 Cumulus Scale Up ? Centaur ? ? ? Axone Advanced I/O ? OMI 48 3 48

Process Technology [ edit ]

POWER9-based microprocessors are fabricated on GlobalFoundries's High-Performance 14 nm (14HP) FinFET Silicon-On-Insulator (SOI) process. The process was designed by IBM at what used to be their East Fishkill, New York fab which has since been sold to GlobalFoundries.

Introduction [ edit ]

IBM introduced the POWER9 scale out variant of POWER in December 2017. Scale up POWER9 processors were introduced in August 2018. The third variant for high I/O will be introduced in 2019.

Compatibility [ edit ]

Initial support for POWER9 started with Linux Kernel 4.8.

Vendor OS Version Notes IBM AIX 7.? Support IBM i ? Support Linux Linux Kernel 4.8 Initial Support Wind River VxWorks VxWorks 7.? Support

Compiler support [ edit ]

Architecture [ edit ]

+ Key changes from POWER8 [ edit ]

14 nm process (from 22 nm) 17-layer metal stack 8,000,000,000 transistors

Support for Power ISA v3.0

Higher single-thread performance

New highly modular architecture

Pipeline Shorter pipeline 5 stages eliminated from fetch to compute vs POWER8 Roughly 5 stages were also eliminated for fixed-point operations Up to 8 cycles were eliminated for floating-point operations Instruction grouping at dispatch has been removed Improved hazard avoidance / reduced hazard disruption

Improved branch prediction

Cache 120 MiB NUCA L3 eDRAM 7 TB/s on-chip bandwidth

Hardware Acceleration PowerAXON Enhanced on-chip acceleration Nvidia NVLink 2.0 CAPI 2.0

I/O Subsystem PCIe Gen4 Local SMP - 16 GT/s per lane interface Remote SMP - 25 GT/s per lane interface 48 PCIe lanes IBM's SMP connect for their scale-up systems Also available for the accelerators

Virtualization QoS assistance New Interrupt architecture Workload-optimized frequency Hardware enforced trusted execution



Block Diagram [ edit ]

This section is empty; you can help add the missing info by editing this page.

Memory Hierarchy [ edit ]

Cache L1I Cache 32 KiB, 8-way set associative 128-byte lines (broken into four 32-byte sectors) Per SMT4 Core Critical-sector-first reload policy L1D Cache 32 KiB, 8-way set associative 128-byte cache line with support for 64-byte sectors Per SMT4 Core Pseudo-LRU replacement policy L2 Cache 512 KiB 8-way set associative 128-byte line Per core pair Inclusive of L1I/L1D L3 Cache 120 MiB eDRAM 10 MiB/core pair 12 chunks (regions) of 10 MiB 20-way set associative 7 TB/s on-chip bandwidth



Overview [ edit ]

POWER9 succeeds POWER8, introducing many core enhancements as well as large architectural changes. POWER9 has taken a highly modular design approach, with the same design supporting up to 12 cores with 96 threads (SMT8) or up to 24 cores with 96 threads (SMT4). IBM offers POWER9 as both scale up and scale out solutions. In total, there are four targeted chip implementations (24C/SO, 24C/SU, 12C/SO, and 12C/SU).

POWER9 comes in two flavors - scale out (SO) and scale up (SU). The scale out variations are designed for traditional datacenter clusters utilizing single-socket and dual-socket setups. The Scale-Up variations are designed for NUMA servers with four or more sockets, supporting large amounts of memory capacity and throughput.

Scale out [ edit ]

Scale-out overview

For the scale out there are two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for the Linux ecosystem whereas the SMT8 model is said to be optimized for the PowerVM ecosystem (AIX / IBM i customers). Those models support up to 8 channels of DDR4 memory for up to 4 TiB of DDR4-2667 memory (per socket). Those models offer up to 120 GiB/s of sustained bandwidth.

Scale out processors have 48 PowerAXON lines (x48) and come with two SMP links.

Scale up [ edit ]

Scale-up overview

The POWER9 scale up is designed for their enterprise servers and come with two variations, a 12-core SMT8 model and a 24-core SMT4 model. The SMT4 is optimized for Linux Ecosystem whereas the SMT8 is said to be optimized for the PowerVM Ecosystem community (AIX / IBM i customers). POWER9 inherits the same buffered memory architecture first introduced with POWER8. POWER9 has two memory controllers capable of driving four differential memory interface (DMI) channels, each with a maximum signaling rate of 9.6 GT/s for a sustained bandwidth of up to 28.8 GB/s. Each of the DMI channels connects to one dedicated Centaur memory buffer chip which, in turn, provides four DDR4 memory channels running at up to 3200 MT/s as well as 16 MiB of L4 cache. All in all, POWER9 scale-up can use eight buffered memory channels to access up to 32 channels of DDR memory and provides an additional 128 MiB of level 4 cache.

Scale up processors have a different set of I/O interfaces. The two memory controllers drive eight memory-agnostic interfaces, come with four times as many PowerAXON lines (x96), and 3 SMP links.

Slice Design [ edit ]

Execution Slice Microarchitecture is POWER9's entirely new refactored core modular design. The same modules were used to build both the SMT4 and SMT8 cores (and in theory scale further to higher thread count although that's not offered this iteration). These modules allow IBM to address the various processor models with support for the different configurations such as bandwidth/lines (from 128 to 64 byte sectors).

A Slice is the basic 64-bit computing block incorporating a single Vector and Scalar Unit (VSU) coupled with Load/Store Unit (LSU). VSU has a heterogeneous mix of computing capabilities including integer and floating point supporting scalar and vector operations. IBM claims this setup allows for higher utilization of resources while providing efficient exchanges of data between the individual slices. Two slices coupled together make up the Super-Slice, a 128-bit POWER9 physical design building block. Two super-slices together along with an Instruction Fetch Unit (IFU) and an Instruction Sequencing Unit (ISU) form a single POWER9 SMT4 core. The SMT8 variant is effectively two SMT4 units.

Acceleration Platform (POWERAccel) [ edit ]

POWERAccel is the collective name for all the interfaces and acceleration protocols provided by the POWER microarchitecture. POWER9 offers two sets of acceleration attachments: PCIe Gen4 which offers 48 lanes at 192 GiB/s duplex bandwidth and a new 25G link which offers an additional 48 lanes delivering up to 300 GiB/s of duplex bandwidth. On top of the two physical interfaces are a set of open standard protocols that integrated onto those signaling interfaces. The four prominent standards are:

CAPI 2.0 - POWER9 introduces CAPI 2.0 over PCIe which quadruples the bandwidth offered by the original CAPI protocol offered in POWER8.

New CAPI - A new interface that runs on top of the POWER9 25G link (300 GiB/s) interface, designed for CPU-Accelerators applications

NVLink 2.0 - High bandwidth and integration between the GPU and CPU.

On-Chip Acceleration - An array of accelerators offered by the POWER9 architecture itself 1x GZip 2x 842 Compression 2x AES/SHA



Pipeline [ edit ]

POWER9 modular design allowed IBM to reduce fetch-to-compute latency by 5 cycles. Similar number of cycles were also cut from fixed-point operations from fetch to retire. Additional 8 cycles were cut from fetch-to-retire for floating point instructions. POWER9 furthered increased fusion and reduced the number of instructions cracked (POWER handles complex instructions by 'cracking' them into two or three simple µOPs). Instruction grouping at dispatch that was done in POWER8 has also been entirely removed from POWER9.

B0 B1 RES IF IC D1 D2 Crack/Fuse PD0 PD1 XFER MAP VS0 VS1 F2 F3 F4 F5 LS0 LS1 AGEN BRD CA FMT CA

SMT4 core [ edit ]





Fetch/Branch Slices issue VSU & AGEN VSU Pipe LSU Slices 32 KiB L1I$

8 fetch, 6 decode

1x branch execution 4x scalar-64b / 2x vector-128b

4x load/store AGEN 4x ALU

4x FP + FX-MUL + Complex (64b)

2x Permute (128b)

2x Quad Fixed (128b)

2x Fixed Divide (64b)

1x Quad FP & Decimal FP

1x Cryptography 32 KiB L1D$

Up to 4 DW Load or Store

Performance Claims [ edit ]

IBM claims a range of performance improvements for a wide array of workloads. The graph below (provided by IBM) compares POWER9 performance using POWER8 as a baseline. The graph represents a scale-out model of similar specs at a constant frequency.

Die [ edit ]

Scale out [ edit ]

GlobalFoundries 14 nm FinFET on SOI Process

17-layer metal stack

8,000,000,000 transistors 15 miles of wire

693.37 mm² die size

25.228 mm x 27.48416 mm





Scale up [ edit ]

GlobalFoundries 14 nm FinFET on SOI Process

17-layer metal stack

8,000,000,000 transistors 15 miles of wire

693.37 mm² die size

25.228 mm x 27.48416 mm





All POWER9 Processors [ edit ]

List of POWER9-based Processors Model Launched Codename Cores Threads L2$ L3$ TDP Frequency Turbo 02CY296 November 2017 Sforza 22 88 5.5 MiB 5,632 KiB

5,767,168 B

0.00537 GiB

110 MiB 112,640 KiB

115,343,360 B

0.107 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.75 GHz 2,750 MHz

2,750,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY227 November 2017 Sforza 22 88 5.5 MiB 5,632 KiB

5,767,168 B

0.00537 GiB

110 MiB 112,640 KiB

115,343,360 B

0.107 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.6 GHz 2,600 MHz

2,600,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY414 November 2017 Sforza 22 88 5.5 MiB 5,632 KiB

5,767,168 B

0.00537 GiB

110 MiB 112,640 KiB

115,343,360 B

0.107 GiB

160 W 160,000 mW

0.215 hp

0.16 kW

2.25 GHz 2,250 MHz

2,250,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY228 November 2017 Sforza 20 80 5 MiB 5,120 KiB

5,242,880 B

0.00488 GiB

100 MiB 102,400 KiB

104,857,600 B

0.0977 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.7 GHz 2,700 MHz

2,700,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY415 November 2017 Sforza 20 80 5 MiB 5,120 KiB

5,242,880 B

0.00488 GiB

100 MiB 102,400 KiB

104,857,600 B

0.0977 GiB

160 W 160,000 mW

0.215 hp

0.16 kW

2.4 GHz 2,400 MHz

2,400,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY416 November 2017 Sforza 18 72 4.5 MiB 4,608 KiB

4,718,592 B

0.00439 GiB

90 MiB 92,160 KiB

94,371,840 B

0.0879 GiB

130 W 130,000 mW

0.174 hp

0.13 kW

2.25 GHz 2,250 MHz

2,250,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY489 November 2017 Sforza 18 72 4.5 MiB 4,608 KiB

4,718,592 B

0.00439 GiB

90 MiB 92,160 KiB

94,371,840 B

0.0879 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.8 GHz 2,800 MHz

2,800,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02AA986 November 2017 Sforza 16 64 4 MiB 4,096 KiB

4,194,304 B

0.00391 GiB

80 MiB 81,920 KiB

83,886,080 B

0.0781 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.9 GHz 2,900 MHz

2,900,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY417 November 2017 Sforza 16 64 4 MiB 4,096 KiB

4,194,304 B

0.00391 GiB

80 MiB 81,920 KiB

83,886,080 B

0.0781 GiB

130 W 130,000 mW

0.174 hp

0.13 kW

2.3 GHz 2,300 MHz

2,300,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY230 November 2017 Sforza 16 64 4 MiB 4,096 KiB

4,194,304 B

0.00391 GiB

80 MiB 81,920 KiB

83,886,080 B

0.0781 GiB

190 W 190,000 mW

0.255 hp

0.19 kW

2.9 GHz 2,900 MHz

2,900,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY771 November 2017 Sforza 12 48 3 MiB 3,072 KiB

3,145,728 B

0.00293 GiB

60 MiB 61,440 KiB

62,914,560 B

0.0586 GiB

105 W 105,000 mW

0.141 hp

0.105 kW

2.2 GHz 2,200 MHz

2,200,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY089 November 2017 Sforza 8 32 4 MiB 4,096 KiB

4,194,304 B

0.00391 GiB

80 MiB 81,920 KiB

83,886,080 B

0.0781 GiB

160 W 160,000 mW

0.215 hp

0.16 kW

3.5 GHz 3,500 MHz

3,500,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

02CY297 November 2017 Sforza 4 16 2 MiB 2,048 KiB

2,097,152 B

0.00195 GiB

40 MiB 40,960 KiB

41,943,040 B

0.0391 GiB

90 W 90,000 mW

0.121 hp

0.09 kW

3.2 GHz 3,200 MHz

3,200,000 kHz

3.8 GHz 3,800 MHz

3,800,000 kHz

Count: 13

Bibliography [ edit ]

IBM, IEEE Hot Chips 28 Symposium (HCS) 2016.

IBM, IEEE Hot Chips 30 Symposium (HCS) 2018.