ARM is usually better known as the provider of CPU IP that powers almost all smartphones and tablets. A lesser-known fact is that ARM ships more GPUs than anyone else - 750m alone in 2015. Known as Mali, you can spy ARM graphics in smartphones, tablets and desktop boxes.

Up until yesterday the premium graphics offering from ARM was the Mali-T880, found in cutting-edge smartphones such as the Huawei Mate 8 and Samsung Galaxy S7. Today, however, ARM is unveiling new graphics technology that promises 50 per cent more performance than ever seen before from the Mali stable. Making this performance leap happen is the Mali-G71 GPU that is built on a brand-new graphics architecture called Bifrost.

Yet before we get to how Bifrost is manifestly different from incumbent Midgard powering Mali-T880 - the new GPU is manifestly more than a scaled-up version of what has gone before - it's worth putting the Mali-G71's performance into some context.

To be made available in one-32-core configurations later on this year, ARM says the 16-core version of Mali-G71 has enough graphics horsepower to beat an entry-level discrete GPU from either Nvidia or AMD in pure performance stakes. Sure, ARM is presenting a best-case scenario using a preferred benchmark, but there's enough clout to play the latest games engines on a mobile device, be it a tablet or premium smartphone.

A new way - Bifrost

Architectures, be they CPU or GPU, can usually be refined and tweaked over successive generations to offer a handy boost in performance. Every so often, a clean break is needed to better take advantage of new workloads and applications that somewhat expose the limitations of older architectures. This is effectively what ARM is doing with Mali-G71, the first graphics chip based on the brand-new Bifrost architecture, supplanting Midgard on the present range.

The major redesign within Bifrost, therefore, looks at ways to do more with a given die area. The first important change is by enhancing the instruction-set architecture to take advantage of clause-based shaders.

Claused shaders

In the present graphics model, shown above left, there is inevitable overhead each time instructions are executed. This overhead takes the form of the necessary scheduling required to make sure any dependencies from one instruction to another are fully taken into account. Dependencies are written into the register file of the GPU and reading/writing them takes time and energy, which is particularly painful on a mobile device.

Clause execution enables the Mali-G71 to group a number of instructions together and have them complete back-to-back without interruption or extra overhead incurred by explicitly writing out their states to the normal register file on a single-instruction basis.

The reason why clause execution can work on a batch basis is because oftentimes the result from one instruction is simply used as the input for the next. Rather than writing each out to the register file, which, as we have mentioned, is costly and inefficient, Mali-G71 instead creates a smaller temporary register file that guarantees a clause will complete successfully without error. The principle is largely the same; the execution is different and more efficient.

Grouping instructions in this fashion can cause a delay if the next clause is not ready, but Mali-G71 has an intelligent scheduler that can fast track other clauses to fill in the intervening time gap.

Quad-thread vectorisation

Present ARM graphics architectures use SIMD vectorisation where one thread is processed at one time for each pipeline stage. Under perfect conditions high efficiency is achieved, but not all modern workloads are so perfectly aligned. In that case, the hardware needs to find other ways of filling the pipeline.

Mali G71 opts for executing four scalar threads in a quad. The effect might seen subtle on the above pictures yet the way in threads are processed has been turned completely on its side. Think of quad vectorisation as running wavefronts on AMD hardware or warps in Nvidia-speak.

Running multiple scalar threads concurrently through the graphics hardware has the advantage of being modern in terms of how developers are writing their engine shader code and is able to maintain higher throughput efficiency to boot.

Going wide, in a graphics sense, accrues ancillary benefits such as power saving that arises from only having to fetch one scalar operation per pipe for every clock cycle. The end result is far less pressure on the instruction cache bandwidth.

The wrap

ARM has previously eked out significant performance from the Midgard graphics architecture powering the Mali-T-series GPUs. However, the changing face of mobile graphics and the way in which games engines require shader code to be processed means that the older approach left efficiency on the table.

The new Bifrost architecture, first productised in the Mali-G71 announced today, increases performance by being more efficient. Key to this efficiency is claused-based shading and a complete reworking of how threads are processed through the hardware.

Bifrost will be the cornerstone on which performance ARM Mali GPUs will be built on for the next few years, so we're eager to see how this first iteration performs in popular benchmarks.