Introduction

This paper talks about taking advantage of Apple* Machine Learning Frameworks Core ML* and Metal* Performance Shaders (MPS) on Intel® Processor Graphics. It covers a performance case study running key machine learning workloads on Intel Processor Graphics and techniques used in achieving high hardware efficiency using highly optimized MPS primitives for Intel Processor Graphics on Apple MacOS* platforms.

Target Audience

This paper is useful to software developers, platform architects, data scientists, and academics seeking to maximize deep learning performance on Intel Processor Graphics on macOS platforms.

Note: Artificial intelligence (AI), machine learning (ML), and deep learning (DL) are used interchangeably in this paper. The larger field is AI. This article is focusing on the ML piece of AI or, more specifically, the multilayered neural networks form of ML called deep learning.

Architecture

Core ML, available on Apple devices, is the main framework for accelerating domain-specific ML inference capability such as image analysis, object detection, and natural language processing, among other things. Core ML allows you to take advantage of Intel® processors and Intel Processor Graphics technology to build and run ML workloads on the device so that your data doesn't need to leave your device—removing the dependency on network connectivity, security, and privacy concerns. Core ML is built on top of low-level frameworks such as MPS (Intel Processor Graphics) and basic neural network subroutines (BNNS) (Intel processors) that are highly tuned and optimized for Intel® hardware to maximize the hardware capability.

MPS is the main building block for Core ML to run ML workloads on graphics processing units (GPUs). As an application developer, you can write your application to use the MPS API directly to target underlying GPU devices.

Figure 1. Apple Machine Learning Inference Stack.

The rest of the article mainly focuses on the MPS layer shown in the above picture, and talks about optimized implementation of the primitives for Intel Processor Graphics Architecture.

Performance

This section shows the performance state of MPS run on Apple macOS Mojave* on Intel Processor Graphics technology for key use cases. The work on improving these is a continuous process, but the following is intended to give to the audience an idea on where it stands (see disclaimer).

General matrix-multiply (GEMM): Current result on Intel® Iris® Graphics 550 processor graphics family for 32 bit and 16 bit floating point types (fp32, fp16).

Y-axis is gigaflops and X-axis is matrix size in the following two figures:



Figure 2. GEMM with fp16 data type vs hardware theoretical.



Figure 3. GEMM with fp32 data type vs hardware theoretical.

Full network topologies on Intel® Processor Graphics



Figure 4. Performance for key machine learning network topologies.

Apple and Intel continuously improves the MPS layers to give the user the best experience with their machine learning application. Below figure 5 shows the improvement we did on MPS running on Intel® processor graphics architecture from macOS High Sierra (macOS 10.13) to macOS Mojave (macOS 10.14). That’s why recommendation is to use highly optimized frameworks like CoreML and MPS to build your machine learning application, where you will get performance boost going from HW and SW generation to generation automatically on Intel® hardware.



Figure 5. Performance improvement from macOS High Sierra to macOS Mojave

Web machine learning (WebML)

There is an industry effort happening currently for web-based ML to plug into native platform optimized ML frameworks to achieve high performance efficiency, security, and privacy needs. Initial proof of concept (POC) on this was done using a Chromium* browser. The project at Intel was done on macOS using an MPS API. The below is the proposed architecture for it, which was implemented using MPS in the backend:

Figure 6. WebML POC Architecture.

Performance results for the above POC was very promising with having MPS as the backend:



Figure 7. WebML POC on macOS vs legacy.

Performance results for the above POC was very promising with having MPS as the backend:

Some Implementation Insight

As mentioned earlier, MPS primitives are highly optimized for underlying Intel Processor Graphics Architecture, and algorithms are specifically designed to fit best for the underlying architecture.

Below, we talk about a couple of primitives from MPS and the optimization techniques used to achieve high hardware efficiency.

General optimization techniques

Optimization requires a detailed understanding of the architecture. The Compute Architecture of Intel® Processor Graphics shows the total number of execution units (EUs) in an Intel Processor Graphics Gen9 that can be used for computing. Other information such as maximum clock allows us to determine how to compute the peak performance of the system. Additional parameters such as the register space per EU, cache size, and peak memory bandwidth are available in the specification that helps to optimize compute applications for performance.

An MPS application consists of CPU host operations and GPU device-specific operations, where host-side optimizations such as proper data sharing, efficient memory allocation, and enqueueing multiple shaders for pipelining tasks, and device-side optimizations that involve tuning of GPU shaders are done to reduce the overall running time of the application. This running time must be as low as possible in order to approach theoretical peak efficiency of the hardware.

The peak performance for a graphics device is determined by the number of floating-point operations per second (FLOPS). The peak FLOPS for Intel Processor Graphics is computed with the following formula:

(MUL+ADD)*Physical SIMD*Num FPUs*Num EUs*Clock Speed

MPS primitives are tuned using several techniques, which are broadly grouped into the following three:

Increasing GPU occupancy

This is the most common optimization technique that is adopted to improve the overall efficiency of a compute shader. It is identified by the total number of GPU cycles when the EUs are idle or underutilized. By effectively distributing the work to the available EUs, the GPU occupancy can be increased, thereby finishing the parallel task quickly. However, if the threadgroups are very light, then the overhead of launching the threads will nullify the benefits of dividing the work. This can be identified as huge ramp-up and ramp-down cost, compared to the actual steady-state execution.

Utilizing the register space for the threadgroups effectively and dividing it among the single instruction, multiple data (SIMD) size is critical. For instance, in Intel gen9 processor graphics, 4 KB of register space can be divided into 512-byte SIMD8 threads or 256-byte SIMD16 threads. Balancing between the number of threads in a threadgroup and the amount of data used by each thread is essential for good performance.

Optimizing the data reads and writes

Once the EUs are occupied effectively, there are cases when the performance is bottlenecked by the memory, where the data will not be readily available to the EUs for computation. This is identified as the total number of GPU cycles when the EUs are stalled. Many applications that do not have enough computation to be done, such as GEMV (vector matrix multiply), suffer from this problem. And when the memory bandwidth is fully utilized, there isn’t much else that the EUs can do, other than wait for more data. Reducing the number of threadgroups and making each thread heavier frees up some threads for other tasks that may not require a lot of data. This can be done simply by vectorizing the data to increase throughput with fewer threadgroups:

// non vectorized float* val = (float*) srcBuf + offset; res = val[0] + val[1] + val[2] + val[3]; // vectorized float4* val4 = (float4*) srcBuf + offset / 4; res = val4.x + val4.y + val4.z + val4.w;

Further, limiting the number of memory accesses in the Intel Processor Graphics instruction set architecture (ISA) improves the performance of the application. A particularly useful technique we use through MPS primitives is block read/write operations to perform input/output operations on regions of buffer/texture that are conveniently located in the GPU memory.

Avoiding expensive ALU operations

If the performance of the application is not satisfactory, even after handling the above-mentioned approaches, using fast instructions suitable for Intel Processor Graphics Architecture will help. We can take these measures when the application is arithmetic-logic unit (ALU) bound, where the EUs are busy for most of the GPU cycles. Since the Intel Processor Graphics Architecture makes use of floating-point unit (FPU) pairs, it is helpful to know that both FPU0 and FPU1 must be equally utilized for obtaining peak efficiency. Avoiding expensive ALU operations such as math: divide, cosine, sine; non-math: shl, cmp, mov; and pointer math and 64 bits ops can improve the performance.

Beyond the above approaches, there are a few things that can improve the performance of applications running on Intel Processor Graphics. The following are two of the most commonly used approaches.

Understanding memory hierarchy A good understanding of the memory hierarchy is very helpful in writing optimized shaders. The most fundamental technique to optimize shaders on Intel Processor Graphics is to quickly swizzle, broadcast, and reduce data residing in the registers.

Zero copy textures from buffers Since Intel Processor Graphics Architecture supports shared physical memory for CPU and GPU resources, it is possible to create formatted GPU surfaces such as texture 2d by pointing to the backing unformatted CPU buffer in the texture descriptor, as shown here: id<MTLBuffer> b = [device newBufferWithLength: bufLen options: MTLResourceStorageModeManaged]; MTLTextureDescriptor *desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat: pixFmt width: w height: h mipmapped: NO]; id<MTLTexture> t = [b newTextureWithDescriptor: desc offset: 0 bytesPerRow: w * pixSize]; Using the zero-copy approach to create a texture essentially creates a linear image (data in row-major order), as opposed to a tiled image; therefore, less efficient for operating on data across multiple rows, as it requires additional address computation. Further, using the linear image is restricted to dimensions where bytesPerRow must be aligned to the image row pitch.

Convolutional neural network (CNN)

MPS provides various primitives such as convolution, depthwise convolution, dilated convolution, pooling, fully connected (for full list check Convolutional Neural Network Kernels), and so on to build up your neural network from highly optimized low-level primitives. These primitives are highly optimized to take advantage of Intel Processor Graphics Architecture.

In this part we’ll cover what optimizations we did for a convolution algorithm to make it reach peak performance on Intel Processor Graphics Gen9 hardware.

Convolution description

A job of a convolutional layer is to traverse over an input image containing N INPUT CHANNELS by step of STRIDE size, and applying a convolutional window of FILTER size. To receive M different outputs per single convolution window, different OUTPUT CHANNELS filters are used. The amount of filter data needed for a convolutional layer is calculated by N * M * K, where K is the number of values in a convolutional window.

The figure below shows a single application of a convolutional window for filter 3 x 3.

Figure 8. Example of applying of a single convolutional window for filter 3x3

Input data storage optimization

Input/output data is stored in RGBA format using a metal texture resource type, where each texel component contains a different input channel. For example, 16 input channels will be stored in four textures in an array of textures. To make the most use of sampling input, four input channels are loaded and stored at a time. This achieves 100 percent utilization. The iteration pattern is also important here. Reading input data from the same texture in an array Filter Size repeatedly, then going to another texture in Texture2D_array, causes the cache to be reused more efficiently.

Efficient weights storage

For weights storage, RGBA texture2D is used. Each texel component contains a different input feature channel. Different output feature channels are stored in X-coordinate and in Y-coordinate; we store filter_X, filter_Y, input channel slices in that order. If the number of input features is not a multiple of 4, the remaining components are filled with zeros. This enables the algorithm to iterate over the Y texture coordinate, while X-coordinate remains the same for the whole execution of the thread. This storage also has very good cache locality, since the 16 output features needed by the SIMD-group are very close in memory.



Figure 9. Example of layout of weights in texture2D. “I” stands for input channels and “O” stands for output channels. Weights contains 3 x’s, 3 y’s, 32 output channels and N input channels



Figure 10. Example of weights storage containing 3 x 3 filters for two output feature channels

Efficient read of input

The algorithm is calculating multiple spatial outputs per thread due to input sharing between convolutional windows, as in the picture below:



Figure 11. Example of data sharing between different applying of convolutional window.

In this way, computing multiple convolution windows is much cheaper, in terms of memory bandwidth needed. Notice that, the smaller the stride and the bigger the filter size, the more fields are shared between convolution windows, and bandwidth savings are greater.

Use of shuffle through GRF and SIMD group

Spatial input data is preloaded into GRF(General purpose register file - this is the fastest memory available for EU) before processing with calculations. This makes reading the data later a lot faster. Every thread preloads a different X-coordinate, looping through Y-dimension. Simd_shuffle instruction is used to share data across SIMD group threads.

Writing of Output Efficiently

For writing output, we use SIMD-group shuffle as well to convert 16 single output features into four groups of four output feature channels each. Then we output them in that form, to have only one write request per SIMD-group.

Matrix multiplication

In this section, we talk about optimization techniques that are used in improving the performance of GEMM operations in MPS. Several aspects of Intel Processor Graphics optimization addressed in the previous section are covered. However, several high-level optimization techniques were not covered before that we see below. A couple of these optimization techniques are the use of SIMD-groups for data sharing among threads and block input/output operations for improving data read bandwidth. First, let us understand the specification and the limitations of the matrix multiplication operation in MPS.

MPS Matrix multiplication specification

MPS contains the highly tuned Matrix API for Intel® Processor Graphics Architecture.

Apple’s MPS provide the interface for the matrix multiplication operations for fast pre- / post- processing of matrix data types. The primitive for the matrix multiplication operation is called as MPSMatrixMultiplication, with inputs and output as MPSMatrix objects, to perform the following computation:

where A and B are input matrices, C is output matrix, and α and β are scalar coefficients of the same data type as A and B.

MPS matrix multiplication is accelerated on Intel Processor Graphics using “execution units” to achieve ~85 percent of peak theoretical performance. The following table shows the performance of the GEMM shader performing matrix multiplication on images with sample dimension and efficiency achieved with 3 different approaches.

Algorithm Time (ms) Efficiency (GFlops) % of Peak GFlops GEMM Naïve 2.2335 15.02 4.60% GEMM SIMD-groups 0.1413 237.42 72.74% GEMM SIMD-groups + Block IO 0.1238 270.97 83.02%

Table 1. Different GEMM implementation efficiency

SIMD-Group based implementation

In order to achieve a theoretical efficiency of at least 50 percent, we adopt an optimization technique that makes use of SIMD-ness of the threads, and high read-write bandwidth of the general purpose register file (GRF) to quickly share the data across the threads. The approach makes use of the concept of SIMD-groups.

A SIMD-group is a subset of threads within a threadgroup, with the size of a power of 2, up to a maximum of 32. The SIMD-groups have the same length, except for the last SIMD-group, in case the threadgroups do not evenly divide the SIMD-groups. Usually the threadgroups are made evenly divisible by the SIMD-group in order to efficiently use SIMD operations. The following shows how a SIMD-group is organized in SIMD-8 mode with a partial SIMD-group:



Figure 12. Thread group with multiple SIMD-groups

simd_broadcast makes a single scalar value present in a particular thread, available for all the threads in the SIMD-group, whereas simd_shuffle makes a vector value (up to 16 wide) present in a particular thread, available to the requesting thread. In either case, the copy is done with very low overhead, as the data is in the register space.



Each thread obtains value at thread0 with one call float val0 = simd_broadcast (val, 0); … Figure 13a. SIMD broadcast on each thread.

Thread0 obtains value at each thread with seven calls float4 val1 = simd_shuffle (val, 1); … Figure 13b. SIMD shuffle on Thread 0.

The picture below illustrates a simple usage of simd_shuffle:

To get green value which is in array[3] in sub-group work item numbered 6 we just need to do :

Sim_shuffle(Array[3],6);

Figure 14. Threadgroups and SIMD-groups spanning multiple threads.

Block-based optimization is one of the most common approaches used to parallelize GEMM operation, where the multiplication product of each block pair can contribute to the net sum of the entire matrix, multiplied. By repeatedly performing this block-based operation in a loop, while accumulating the partial results as shown in the diagram, the matrix product is computed.



Figure 15. Block-based matrix multiplication.

The block-based matrix multiplication work makes use of SIMD operations like simd_broadcast or simd_shuffle.

Within each block matrix multiplication, each thread reads eight values of data from Matrix A, and eight values of data from Matrix B, laid out in column-major order. Here, each row in Matrix A needs to be multiplied with each column in Matrix B. Therefore, threads 0 to 7 use simd_shuffle to gather values from each thread and computes the dot product with each element in Matrix B to give a partial result for one column of C. The column is complete when all rows of Matrix A follow this step.

The following illustration shows how thread0 gathers all the values of Matrix A from other threads for computing the partial result for column 0.



Figure 16. Thread0 requests data from other threads to compute one column of output.

Simd_shuffle / simd_broadcasts of different block sizes are available, depending on the need. In this particular example, for a block size of 8 x 8, each SIMD8 thread reads 8 rows of scalars. For a block size of 16 x 16, each SIMD8 thread reads 16 rows of vector 2. Alternatively, for the same 16 x 16 matrix, an SIMD16 thread would read 16 rows of scalar each. Changing the block size to increase the work done per thread increases EU occupancy, especially for large matrix dimensions.

The SIMD-group based implementation shown above achieves greater than 70 percent of the theoretical performance on a relatively small matrix. However, the performance does not scale beyond a 1 K x 1 K matrix, because of increased address calculation for individual rows of data.

Block Input/Output-Based implementation

In order to have sustained performance scalability beyond certain matrix dimensions, we make use of a technique that is unique to Intel Processor Graphics. This allows us to read a block worth of data suitable for matrix multiplication cases taking advantage of the Intel Processor Graphics cache locality to achieve higher efficiency, combining with the SIMD group-based approach.

Summary

This article is intended to show the benefit of using Core ML and MPS API on macOS platforms to take full advantage of the underlying Intel® Processor Graphics architecture. The APIs are highly optimized for ML needs and there should be no need to write your own ML primitives for the underlying device. The idea is to use these as an optimized library to build your ML application on top, and take full advantage of the power of the Intel Processor Graphics that ship with macOS platforms. As mentioned earlier, Apple and Intel continuously improving ML layers from hardware and software generation to generation to give end user seamless bump in performance for the ML layers.

References

For more information or to get started, download the tools or libraries from the links below:

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software, or configuration may affect your actual performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors and processor graphics. The specific numbers and it is intended for providing reference.

Performance results are based on testing on macOS HighSierra and macOS Mojave releases at different times (see individual performance chart disclaimers) and may not reflect all publicly available security updates. No product can be absolutely secure. Configurations used for test and perf data: MacBook Pro 13” with Iris® Graphics 550, 530 some with fixed 850 Mhz frequency and some with dynamic frequency. All testing was performed at Intel. Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for providing reference.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.html.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, and Iris are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.