Over a dozen special-purpose accelerators compatible with next-generation OpenPOWER servers that feature the Coherent Accelerator Processor Interface (CAPI) were revealed at the OpenPOWER Summit last week. These accelerators aim to help encourage the use of OpenPOWER based machines for technical and high-performance computing. Most of the accelerators are based on Xilinx high-performance FPGAs, but some feature custom silicon.

IBM’s CAPI port is a PCIe 3.0-based interconnection specifically designed for programmable processors (e.g., ASICs, GPUs, FPGAs, etc.) that enables them to address the same memory address space as the CPU. CAPI requires custom hardware incorporated into IBM’s POWER8 processors, which is called the coherent accelerator processor proxy (CAPP), as well as a POWER service layer (PSL) integrated into CAPI-supporting processors. CAPP maintains a directory of cache lines held by the accelerator and snoops the processor bus for the accelerator. The PSL performs address translations and holds the coherent data for quick access by the accelerating hardware. To work, CAPI has to be supported by the hardware, the operating system and the application in use. At present, IBM’s POWER8 CPUs, a number of accelerators, RedHat Enterprise Linux 7.2 LE (and higher), and Ubuntu LE, as well as select programs, support CAPI.

IBM and the OpenPOWER Foundation need CAPI in order to enable a relatively simple and inexpensive way to build special-purpose accelerators for various workloads. The aim is to make POWER8-based machines viable for a variety of market segments as well as to create platforms that can process modern workloads faster.

While it is possible to enable unified memory for CPUs and co-processors using custom hardware and multiple tweaks in device drivers, this requires huge investments in silicon development, complex drivers and a number of other things. By contrast, programming an FPGA (field-programmable gate array) is considerably cheaper, and the CAPI technology brings them key heterogeneous processing capabilities. While this does not necessarily enable higher bandwidth between the CPU and the accelerator (after all, CAPI is layered on top of PCI Express 3.0 and a specified peak bandwidth), according to IBM they remove overheads, improve performance and can potentially simplify the workflow for programmers. In short, CAPI is an important part of IBM’s POWER strategy in general as well as OpenPOWER initiative.

At this year’s OpenPOWER Summit, IBM and its partners revealed over a dozen of special-purpose CAPI-enabled FPGA-based accelerators. This shows that the OpenPOWER platform is gaining interest and investment from different sources. The list of developers includes such companies as BittWare, DRC, IBM, Mellanox, Xilinx and others, but some decided not to publish details about their accelerators, as it seems from OpenPOWER’s press release. The accelerators revealed at the conference are either available or are set to become available in the coming quarters. The devices come in the form of PCIe 3.0 x8 or x16 cards and are compatible with IBM POWER8-based servers. Some are also compatible with machines running other processors (and in this case, CAPI is not supported).

IBM CAPI-Compatible Accelerators Developer Model Hardware and Application Alpha Data ADM-PCIE-8K5 Xilinx UltraSCALE KU115-2 FPGA

2×8 GB of DDR4-2400 with ECC (32 GB version can be built)

Dual Firefly connectors for up to 4×16Gbps per connector



Reconfigurable accelerator for custom video processing, machine learning, HPC and network acceleration applications.



Available as add-in PCIe 3.0 x8 cards. BittWare XUSP3S Xilinx Virtex UltraScale 80/95/125/160/190 or Kintex UltraScale 115

2×16 GB DDR4 ECC (64 GB version can be built), QDR memory

Four QSFP28 cages for 1×400GbE, 4×100GbE, 4×40GbE, 16×25GbE, or 16×10GbE



Massive data flow and packet processing.



Available as add-in PCIe 3.0 x16 cards. DRC GraphFind Xilinx Kintex UltraScale KU115 FPGA



Can rapidly discover relationships between people, places, events, and objects. Simultaneously identifying focal points with weighted strengths of connections.



Available as a PCIe card, or as a pre-configured appliance consisting of multiple cards. DRC Novara Xilinx FPGA



A search engine and an accelerator, which identifies key imprecise phrases and Bit patterns using a fuzzy logic analyzer that can instantly analyze millions of messages and data streams without the need to index first. Can process up to 2.5 GB of data per second.



Available as 1U server, which contains up to four Novara cards. Servers can be clustered. DRC Ferrara2 Xilinx FPGA, four QSFP28 cages.



Encrypts and/or authenticates data using AES-256 algorithm with bit-splitting capability from Security First Corporation (SFC) at line rates up to 40 Gb/s.



Available as PCIe 3.0 x16 add-in boards for servers, communication or storage systems. Multiple Ferrara2 boards can be placed in one system. Edico Genome DRAGEN Genomics Platform Xilinx Virtex-7 980T FPGA

4×4 GB DDR3L-1866 memory.



Analyzes an entire human genome in 26 minutes (vs. 30 hours on general-purpose hardware). Enables healthcare providers to identify patients at higher risk for cancer before the conditions worsen.



Compatible with the IBM S822LC server. Available in a pre-configured Power8 server. IBM Prototype Xilinx Virtex UltraScale 190

16 GB of Micron HMC memory.



Acceleration of in-memory computing applications.



Available as add-in PCIe 3.0 x16 cards. IBM,

Nallatech,

RedisLabs

Altera IBM Data Engine for NoSQL IBM Power S822L server(s)

IBM FlashSystem 840 or 900 all-Flash storage system(s)

Altera Stratix V FPGA-based interconnection card with 10 GbE SFP+ ports by Nallatech



IBM FlashSystems are attached to the POWER8 processor through the CAPI coherent attach card.



Thanks to the new interconnection method, the Redis Enterprise Cluster application can issue read/write commands that eliminate 97% of the code path length.



According to IBM, this enables IBM Data Engine for NoSQL to access Flash within latency levels comparable to traditional RAM-based x86 implementations.



Various configurations available. IBM, Nallatech, Samsung, Xilinx Prototype Xilinx FPGA

2×1 TB Samsung M.2 NVMe SSDs.



IBM Data Engine for NoSQL, which allows fast application exploitation in a smaller, in-server form-factor.



Available as add-in PCIe 3.0 x8 cards. Mellanox ConnectX-4 VPI ConnectX-4 VPI



ConnectX-4 adapter cards with virtual protocol interconnect (VPI) support EDR 100 Gb/s InfiniBand and 100 Gb/s Ethernet connectivity.



Available as add-in PCIe 3.0 x16 cards. Semptian NSA-120

NSA-120B Xilinx Kintex UltraScale XCKU060/XVKU115

2×4 GB or 2×8 GB DDR3-1600 memory with ECC

Two SATA interfaces



Network and service accelerator. Can be used in big data analysis, image recognition/processing, video encoding/decoding, data compression/decompression, data encryption/decryption, voice recognition, neural network, machine learning, network security, etc.



Available as add-in PCIe 3.0 x8 cards.

One of the important announcements at the summit was Edico Genome’s DRAGEN genomics platform, which uses an accelerator powered by the Xilinx Virtex-7 980T FPGA and is equipped with 16 GB of quad-channel DDR3L-1866 memory. The platform, which is based on a 2-way IBM S822LC server, can analyze an entire genome in 26 minutes, down from approximately 30 hours on general-purpose processors. An earlier prototype was shown at SuperComputing 2015, however this seems to be the announcement of the full product.

Other interesting solutions discussed at the summit include an FPGA-based accelerator for discovering relationships hidden in big data; an FPGA-powered fuzzy search engine for imprecise string searching and matching, which can analyze millions of messages and data streams without indexing; as well as various reconfigurable accelerators for HPC, Big Data, and so on. IBM also mentions that there are companies offering CAPI-enabled building blocks for FPGAs for computer vision, machine learning, and other applications. Some of those companies are startups or working in stealth mode (we do not know whether they developed their building blocks thanks to the SuperVessel program, though this is a possibility), and they may announce their products over time.

While the number of CAPI-enabled accelerators available today is not high, it is growing, which is a good news for the OpenPOWER ecosystem. Positive news (from IBM) is the number of China-based companies developing accelerators featuring CAPI, which shows that local companies in growing markets for servers are expressing interest in such solutions.