Researchers with the Moscow Center for SPARC Technologies (MCST) have announced a new, homegrown CPU core is ready for the mass market. Dubbed the Elbrus-4C, this new chip is based a VLIW (Very Long Instruction Word) architecture — and includes the ability to emulate x86 instructions. Most of our discussions at ExtremeTech involve architectures from AMD, Intel, or ARM — but this is an interesting opportunity to look at something altogether different.

The Elbrus-4C is similar to Intel’s Itanium in several ways. Itanium had its own architecture, dubbed Epic, but like Elbrus, its design was rooted in VLIW principles and approaches.

VLIW: The road not taken

Very Long Instruction Word processors are designed to execute many instructions simultaneously. A modern out-of-order processor, like Intel’s Haswell, can execute up to eight instructions (uops) per clock cycle. The Elbrus-4C, in contrast, can execute up to 23 instructions per cycle (again, under optimum conditions). The enormous difference in IPC is meant to offset the frequency gap — the Elbrus-4C is clocked at 800MHz, compared with 3GHz-4.4GHz for a modern Haswell.

The principles of VLIW processors are very different than those that drive modern CPUs. Back when the VLIW approach debuted, in the early 1980s, many of the approaches now used by even low-power chips like ARM’s Cortex family were prohibitively expensive in terms of either power, die size, or memory costs. Techniques like out-of-order execution, which debuted in consumer products with Intel’s Pentium Pro, were initially very expensive and required large transistor budgets.

The goal of a VLIW design is to simplify the so-called front-end of the processor, where instructions are decoded, scheduled, and executed. Even today, there’s a limit to how wide a modern out-of-order chip like Haswell can realistically be. If you want to decode and dispatch more instructions per cycle, you need to track more data related to which instructions are executing in which pipelines. The memory and power requirements necessary to keep more instructions in flight scale far faster than the amount of work you can theoretically perform per cycle.

VLIW shortcuts this problem by shifting all the heavy lifting and parallelization to the compiler and simplifying the on-chip resources. It’s an attractive idea — use the CPU for what it’s really, really good at (executing code), and shift the tricky task of extracting parallelism to a flexible software front-end that can theoretically be upgraded and improved much more easily than the CPU architecture. Need better performance from your VLIW CPU? Build a better compiler that can extract more parallelism, and you’ve got it — no need to swap out your enterprise-class CPUs for newer variants.

Unfortunately, the compilers required to make VLIW viable for general-purpose compute never materialized. Moving the burden of code optimization to the software stack might have been a great idea in theory 30 years ago, when out-of-order execution was extremely expensive, but the hardware improved much faster than the software on this front.

The last 40 years have seen multiple VLIW companies come and go. Early supercomputer startups Cydrome and Multiflow both failed after a few years, and Intel’s i860 (a CPU once floated as a possible x86 replacement) saw some success in the DSP market, but never became popular. Intel’s Itanium architecture, dubbed Epic, is related to VLIW but includes some features that VLIW CPUs typically lacked.

Performance and x86 emulation

In native VLIW code, the Elbrus-4C can perform a maximum of 50 GFLOPS total (25 GFLOPS of double-precision). A modern Core i7-4770K, in contrast, hits around 176GFLOPS when running AVX2-optimized code across all four cores and roughly 50 GFLOPS per core according to optimized tests performed by Puget Sound. While the Elbrus-4C is only a fraction of a modern Intel core’s performance, the chip itself is built on 65nm technology and operating at a fraction of the Intel CPU’s clock, without any support for the SIMD instruction sets that give Intel a further advantage. Relative to the process node and frequency it’s clocked at, the claimed 50GFLOPS per quad-core is reasonably good.

The diagram below shows how Elbrus can either convert x86 code to run on its own platform or execute native VLIW instructions (thanks to Anshel Sag, of Moore Insights & Strategy for the initial Russian translation) and to eonvee375 and Aaronb for actually replacing the image text with English:

Like the old Transmeta Efficeon, the Elbrus can emulate x86 instruction, but data on how it performs against a modern core is limited. Old slides of the Elbrus-2C+ show that chip’s performance relative to the Atom D510 (a dual-core Pineview variant clocked at 1.66GHz) and what appears to be a single-core Pentium M clocked at 1GHz.

Spec CPU2000 is utterly outdated, but what we see here still tells us something. At 500MHz, the old Elbrus-2C was significantly faster than either the Atom or Pentium M in x86 floating point code. In integer code, the Elbrus-2C was all over the map, sometimes beating past the dual-core Intel Atom, often losing out to the Pentium M. (It’s an open question just how threaded workloads this old realistically were).

The Elbrus-4C reportedly increases the L1 instruction cache to 128KB, up from Elbrus-2C’s 64KB. It also boosts the L2 cache size to 8MB, shared between all four cores. The triple-channel DDR3 controller offers up to 38.4GB/s of memory bandwidth.

It’s not clear if the MCST intends to bring Elbrus-4C to the mass market, or if the chip was developed for homegrown supercomputer or military applications. Having its own chip design gives Russia a way to ensure its computers aren’t operating on “foreign” processors that might have built-in backdoors or be compromised in some other fashion. It’s worth noting that despite collectively spending billions on VLIW research and development over the past few decades, no major CPU designer managed to build a VLIW general-purpose chip for the mass market. (AMD adopted VLIW for several GPU generations before moving away from the architecture with the advent of GCN.)