Nigel Stephens, Lead ISA Architect and Fellow, Architecture and Technology Group, Arm

This month, Arm is making available early technical details of two significant new technologies for its A-Profile architecture, both of which are designed to enhance the performance and scalability of parallel software. These new technologies are the Scalable Vector Extension version two (SVE2) and the Transactional Memory Extension (TME).

The purpose of this early disclosure is to inform and enable the OS and tools developer ecosystems, so that support will be widely available by the time CPUs, which deploy these new technologies, become available.

SVE2 allows a wider range of software to benefit from the advanced, scalable SIMD vector technology of the original SVE architecture, announced in 2017. TME allows certain classes of multi-threaded software to be scaled more easily, from running on a few CPU cores to running on many hundreds of cores.

Scalable Vector Extension 2 (SVE2)

The first version of the Scalable Vector Extension (SVE) was not a variant of the Arm Neon instruction set, but was targeted at the high-performance computing (HPC) space, bringing many advanced vectorization technologies to Arm-based processors. These could choose to implement vectors ranging from 128 up to 2048 bits in length. Its novel vector length-agnostic programming model allows vector code to be compiled or written once, and then scaled automatically to exploit the implemented vector length, reducing software development and deployment costs.

SVE2 builds on the foundations of SVE to bring the benefits of scalable SIMD vector performance and advanced auto-vectorization capabilities to a wider range of software, including DSP and multimedia SIMD codes that currently use Neon. It also adds many new features to further expand the use of SIMD vector hardware and increase the amount of fine-grain, data-level parallelism in programs.

For backwards compatibility, the Neon instruction set remains fully supported. However, on future CPUs which implement SVE2, scalable SIMD code using SVE2 can be as performant as Neon when running on the same 128-bit vector length.

Other benefits of SVE2 include:

Scaling of performance as the hardware vector length increases, without having to rewrite or recompile code, can allow support of large-scale data processing workloads on a general-purpose CPU, with less need for specialized hardware accelerators, as shown in the image below.

The advanced auto-vectorization techniques, enabled by SVE2, allow more loops to be vectorized by compilers, increasing the amount of fine-grain, data-level parallelism while reducing the need for hand coding by specialist SIMD programmers.

Parity and beyond with traditional Neon DSP/Media workloads

Transactional Memory Extension (TME)

The Transactional Memory Extension brings Hardware Transactional Memory (HTM) support to the Arm Architecture. Transactional Memory is used to address the difficulty of writing highly concurrent, multi-threaded programs in which the amount of coarse-grain, thread-level parallelism can scale better with the number of CPUs, by reducing serialization due to lock contention.

Although high performance can be achieved using lock-free programming techniques, such code can take many years to develop because it is very hard to reason about, test and debug. Transactional Memory is a technology which reduces the difficulty of developing such software, while allowing the performance of concurrent accesses to large, shared data structures in memory to scale easily to the new breed of processors that contain many parallel CPU cores.

One of the most promising uses of Transactional Memory is known as Transactional Lock Elision (TLE), which allows existing regions of code, protected by locks, to be executed concurrently within a transaction. This happens with no modification to the multi-threaded program, and only falls back to the less optimal lock-taking path if the hardware detects a conflict within the transaction.

Developing software for SVE2 and TME

Hand-in-hand with the development of these new architecture technologies, we have been preparing simulation models, software development tools, optimized libraries and programming guides to enable early software exploration and porting. An early access software development environment, including compiler, debugger, and models for virtual prototyping is available now for lead architecture partners.



Moreover, we will soon begin the process of contributing support for SVE2 and TME to key open source initiatives, such as the LLVM and GNU toolchains, to ensure that the software ecosystem can be ready when the first devices become available.

Additional resources are available

Arm is continually working on improvements to its architecture. These new architecture technologies, SVE2 and TME, have been in development for several years, along with the associated tools and models, and will provide improved, scalable performance across a range of future A-Profile Arm-based devices.

We presented a more detailed presentation of the SVE2 and TME at Linaro Connect Bangkok, in April 2019. A PDF is available to download here.

Download the XML