Arm’s Cortex-A76 design offers speed/efficiency improvements including a 4x boost in AI performance, and is paired with a new Mali-G76 GPU that is also said to aid AI. Meanwhile, Arm revealed more details on its upcoming ML co-processors.



Very few non-server systems run software that could be called machine learning (ML) and artificial intelligence (AI). Yet, server-class “AI on the Edge” applications are coming to embedded devices, and Arm intends to fight with Intel and AMD over every last one of them.







Cortex-A76 performance comparisons

(click image to enlarge)

(Source: Arm)



Arm recently announced a new Cortex-A76 architecture that is claimed to boost the processing of AI and ML algorithms on edge computing devices by a factor of four. This does not include ML performance gains promised by the new Mali-G76 GPU. There’s also a Mali-V76 VPU designed for high-res video. The Cortex-A76 and two Mali designs are designed to “complement” Arm’s Project Trillium Machine Learning processors (see below).



Cortex-A76 builds on new architecture

The Cortex-A76 differs from the Cortex-A73 and Cortex-A75 IP designs in that it’s designed as much for laptops as for smartphones and high-end embedded devices. Cortex-A76 provides “35 percent more performance year-over-year,” compared to Cortex-A75, claims Arm. The IP, which is expected to arrive in products a year from now, is also said to provide 40 percent improved efficiency.

Like Cortex-A75, which is equivalent to the latest Kyro cores available on Qualcomm’s Snapdragon 845, the Cortex-A76 supports DynamIQ, Arm’s more flexible version of its Big.Little multi-core scheme. Unlike Cortex-A75, which was announced with a Cortex-A55 companion chip, Arm had no new DynamIQ companion for the Cortex-A76. However, the diagram below suggests that it, too, is designed to work with the Cortex-A55 in heterogeneous designs.







Cortex-A76 with Cortex-A55 in Dynamiq configured heterogeneous SoC

(click image to enlarge)

(Source: Arm)



Cortex-A76 enhancements are said to include decoupled branch prediction and instruction fetch, as well as Arm’s first 4-wide decode core, which boosts the maximum instruction per cycle capability. There’s also higher integer and vector execution throughput, including support for dual-issue native 16B (128-bit) vector and floating-point units. Finally, the new full-cache memory hierarchy is “co-optimized for latency and bandwidth,” says Arm.

Unlike the latest high-end Cortex-A releases, Cortex-A76 represents “a brand new microarchitecture,” says Arm. This is confirmed by AnandTech’s usual deep-dive analysis. Cortex-A73 and -A75 debuted elements of the new “Artemis” architecture, but the Cortex-A76 is built from scratch with Artemis.

The Cortex-A76 should arrive on 7nm-fabricated TSMC products running at 3GHz, says AnandTech. The 4x improvements in ML workloads are primarily due to new optimizations in the ASIMD pipelines “and how dot products are handled,” says the story.

Meanwhile, The Register noted that Cortex-A76 is Arm’s first design that will exclusively run 64-bit kernel-level code. The cores will support 32-bit code, but only at non-privileged levels, says the story.



Mali-G76 GPU and Mali-G72 VPU

— ADVERTISEMENT —



The new Mali-G76 GPU announced with Cortex-A76 targets gaming, VR, AR, and on-device ML. The Mali-G76 is said to provide 30 percent more efficiency and performance density and 1.5x improved performance for mobile gaming. The Bifrost architecture GPU also provides 2.7x ML performance improvements compared to the Mali-G72, which was announced last year with the Cortex-A75.

The Mali-V76 VPU supports UHD 8K viewing experiences. It’s aimed at 4×4 video walls, which are especially popular in China, and is designed to support the 8K video coverage, which Japan is promising for the 2020 Olympics. [email protected] streams require four times the bandwidth of [email protected] streams. To achieve this, Arm added an extra AXI bus and doubled the line buffers throughout the video pipeline. The VPU also supports [email protected] decode.



Project Trillium’s ML chip detailed

Arm previously revealed other details about the Machine Learning (ML) processor, also referred to as MLP. The ML chip will accelerate AI applications including machine translation and face recognition.

The new processor architecture is part of the Project Trillium initiative for AI, and follows Arm’s second-gen Object Detection (OD) Processor for optimizing visual processing and people/object detection. The ML design will initially debut as a co-processor in mobile phones by late 2019.







ML architecture

(click image to enlarge)

(Source: Arm via AnandTech)



Numerous block diagrams for the MLP were published by AnandTech , which was briefed on the design. While stating that any judgment about the performance of the still unfinished ML IP will require next year’s silicon release, the publication says that the ML chip appears to check off all the requirements of a neural network accelerator, including providing efficient convolutional computations and data movement while also enabling sufficient programmability.

Arm claims the chips will provide >3TOPs per Watt performance in 7nm designs with absolute throughputs of 4.6TOPs, deriving a target power of approximately 1.5W. For programmability, MLP will initially target Android’s Neural Networks API and Arm’s NN SDK.

This article is copyright © 2018 Linux.com and was originally published here. It has been reproduced by this site with the permission of its owner. Please visit Linux.com for up-to-date news and articles about Linux and open source.

