Introduction

The Ultra96 provides both processors system and programmable logic. One of the key benefits of this heterogeneous approach is the ability to accelerate functions from the running on the processing system to being implemented in logic.

Zynq MPSoC Architecture

Done correctly (we will explore this more further down) moving functions into the programmable logic results in a significant acceleration in performance.

To get the best from this acceleration, we also want to accelerate the design cycle and remove the need to create separate PS / PL developments. These separate design processes have traditionally accounted for an increased development time when accelerating functions into the PL.

This increasing time scale stems from the need to create a hardware description language (HDL) block, verify its performance and create SW to drive the new module.

Ideally what we want is a system optimizing compiler which allows us to move functions between the PS and PL seamlessly and with ease.

Introducing SDSoC

SDSoC is such a system optimizing compiler and rather helpfully with the Ultra96 we also get a license for SDSoC.

Using SDSoC we can move functions from the PS to the PL with only an increase in the compile time. Although to get the best performance we need to understand a little about logic design and behavior on hardware.

SDSoC enables movement between the PS and the PL thanks to a combination of the Vivado HLS and a connectivity frame work.

Vivado HLS is called to convert the function from C, C++ into a HDL module for implementation. This module is then mapped into a new Vivado design using the connectivity framework e.g. insertion of DMA.

When a function is accelerated into the PL the accelerated SW function is regenerated to reflect the transfer of data and the control of the PL accelerator.

SDSoC Under the hood

To use SDSoC we must have a platform definition for the board we are working with. This platform has two elements

Hardware definition - This defines the AXI interfaces available for the connectivity framework. With the MPSoC it is best to provide cache coherent ports if possible e.g. Accelerated Coherency Port (ACP), AXI Coherency Extension (ACE) or the High Performance Coherent AXI Interfaces. This hardware definition also includes available fabric clocks, interrupts and reset.

Software definition - This defines the SW architecture for the supported baremetal, freeRTOS and Linux operating systems. As such this includes elements such as FSBL, BIF files, Linux Images, BSPs etc.

I should point out the SW definitions do not need to include all of the supported operating systems, bare metal is the minimum requirement.

Introduction to High Level Synthesis

High Level Synthesis (HLS) enables developers to work in higher level languages than VHDL or Verilog. Typically these languages are C, C++ and OpenCL. HLS converts the higher level language into a HDL description that can be synthesized by Vivado. To achieve this Vivado HLS goes through the following stages when creating the HDL description.

HLS Synthesis Flow

To provide control over HLS optimizations, we can use pragmas within the source code. While there are many pragmas which can be used some of the most commonly used are

Data Flow– Allows for optimizations across functions

Pipeline stages – Defines a iteration interval which is the target for processing a new input to the function

Partition Memory to increase Read / Write Bandwidth – Breaks down arrays to provide for more read and write options

We will explore these concepts in detail when we examine he example source code.

Library Support

Of course, it further helps reduce the development time if we have a range of libraries available which can be accelerated into the PL. Rathr helpfully SDSoC provides us with support for acceleration of the following libraries.

HLS Math Library

HLS IP Library

HLS Linear Algebra Library

HLS Arbitrary Precision Data Types

HLS Video Libraries

HLS DSP Library

HLS Stream Library

HLS SQL Library

What Should We Accelerate?

To get the best from SDSoC we need to transfer large quantities of data to and from the PL using DMA. If we are transferring small segments of data between the PS and PL the data transfer time will dominate and impact the results of the acceleration.

Amdahl's law can be used as a good indication to the acceleration achieved by moving a function from the PS to the PL.

Amdahl's Law

Where

S: overall performance improvement

Alpha: percentage of the algorithm that can be sped up with hardware acceleration

1-alpha: percentage of the algorithm that cannot be improved.

p: is the speedup due to acceleration (%).

Set Alpha to 0.1 and select speed up - even with large acceleration P defined, speed up is close to 1

Set Alpha to 0.5 and select same speed up – close to factor of two improvement.

Getting up and Running

For this example we are going to use the Ultra96 platform provided by Avnet, you can download it here

Obtaining the SDSoC Platform

This platform contains only support for the baremetal OS but it is sufficient to get us going with using SDSoC.

Once we have downloaded the SDSoC platform the next step is to open SDSoC for the first time. When you do this, you will be asked for the location of the SDSoC workspace, it is within the workspace that your projects will be stored.

Selecting the workspace

With the workspace initialized, you will see SDSoCs initial IDE page.

Initial SDSoC View

Before we can use SDSoC the next step is to unzip the SDSoC Platform we downloaded earlier.

Unzipping the platform

You will notice the unzipped SDSoC platform contains two folders Platforms and Prebuilt.

It is within the platforms folder that we will find the Ultra96 SDSoC Platform.

The next step is to add in the custom platform. We can do this by selecting Xilinx->Add Custom Project

Adding in the Custom Project

This will open a dialog which allows us to manage the custom project

Custom Platform Management Dialog

Click on the Add Custom Project button and navigate to the top level of the SDSoC platform for the Ultra96

Selecting the Ultra96 Platform

With the platform added the next step is to create an example project, open the project creation wizard by clicking on File->New SDx Project

First page of the project creation wizard

One the second page we need to name the project, remember not to use white spaces in the name use underscores instead.

Naming the project

The third page of the creation wizard is to select the platform, select the Ultra96 project which will show in yellow as a custom project.

Selecting the Ultra96 project

The penultimate page of the project creation wizard is to define the processing system configuration.

Penultimate page of the project creation wizard

The final page of the project creation wizard is to select the Xilinx Matrix Multiply example project.

Selecting the example project.

At the completion of the wizard you will see the project created and the Application Project Settings tab.

It is this tab which controls the project and what functions if any are accelerated into the programmable logic.

The project is configured to select the mmult_accel function into the PL, this can be seen under the hardware functions tab. Once this has been accelerated into the PL it will be clocked at 100 MHz.

Application Project Settings

With the project created we can now click on the build button (hammer icon) on the menu bar and start a build.

However, if after doing this you receive the following error after a few minutes you need to make the edits below to your board definition file under Vivado (You did install these right? if not find out how here)

Board_part property not set error

To correct for this error you need to open the board definition file located at

C:\Xilinx\Vivado\2018.2\data\boards\board_files\ultra96\1.2

The error results as the name of the board file is not what is expected by the SDSoC platform.

Open board.xml and change the name from Ultra96v1 to Ultra96

incompatible board name in board.xml

Corrected project name

Correcting this file will then enable the project to build and the boot.bin file be created for testing on the Ultra96.

Code Deep Dive

Before run the example application on the hardware, I think we should examine the optimizations made in the accelerated code.

void mmult_accel(float A[N*N], float B[N*N], float C[N*N]) { float _A[N][N], _B[N][N]; #pragma HLS array_partition variable=_A block factor=8 dim=2 #pragma HLS array_partition variable=_B block factor=8 dim=1 for(int i=0; i<N; i++) { for(int j=0; j<N; j++) { #pragma HLS PIPELINE _A[i][j] = A[i * N + j]; _B[i][j] = B[i * N + j]; } } for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { #pragma HLS PIPELINE float result = 0; for (int k = 0; k < N; k++) { float term = _A[i][k] * _B[k][j]; result += term; } C[i * N + j] = result; } } }

This example uses two of the three most commonly used HLS pragmas ARRAY_PARTITION and PIPELINE

In HLS arrays are often converted into block RAMs, these can be a bottleneck as they have a maximum of two ports. ARRAY_PARTITION fractures the array to be contained in several BRAMs so that parallel access can be made accelerating the performance.

ARRAY_PARTITION, different fracturing patterns (source Xilinx UG902)

In the code example

#pragma HLS array_partition variable=_A block factor=8 dim=2

This applies the block fracture to the variable _A (32 by 32 matrix) creating 8 BRAM in place of one, this is defined by the factor parameter. While the dim parameter is used in multi dimensional arrays to define the element which is fractured. In this case as dim=2 the second dimension is fractured.

To optimize the performance of the for loops we can use the PIPELINE pragma

#pragma HLS PIPELINE

The PIPELINE pragma enables operations to occur concurrently, without the need for the entire processing chain to complete before the next operation starts.

Pipelining (source Xilinx UG902)

We can use the PIPELINE attributes to unroll for loops. Unrolling loops trades performance for area, as such care needs to be taken. One rule of thumb is to initially unroll the inner most loops.

While the latency is the same when the PIPELINE command is used, the throughput is significantly increased.

For larger designs with multiple functions, we can use the DATAFLOW pragma to optimize across functions.

DataFlow (source Xilinx UG902)

The difference between PIPELINE and DATAFLOW is that DATAFLOW is a coarse grain approach which works on functions. While PIPELINING is a fine grain approach working on the operators within a function.

Running the Example

Now that we understand what is occurring and the optimization that have been made, we can run the application on the hardware. To verify this is working as required we need to access a UART port hence the need for the JTAG UART USB converter.

Within SDSoC we need to select the debugger and configure a new debug environment.

Configuring the SDx Application Debugger

When we run this on the hardware we should see a significant increase in the performance from the accelerated function. This will be reported over the UART, the data rate is 115200 and we can connect using the SDx Terminal.

Results from the example

Now that we understand the flow we will want to develop our own applications and identify which elements to accelerate and how to fine tune the performance of the accelerated function.

Doing this requires a little more understanding of SDSoCs capabilities, including

Profiling - This identifies where time is spent in the execution of the program in the PS. The profiler shows both function inclusive and exclusive execution time. We can use the profiler to identify functions which should be considered for acceleration.

Tracing - Inserts dedicated hardware into the PL to monitor behavior, this enables us to examine how much time is spent in each aspect of the execution. This provides a detailed system understanding during execution.

Profiling

To open the TCF profiler click on Window > Show View > Other > Debug > TCF Profiler.

Opening the TCF PRofiler

Once we have opened the profiler we need to run the debug on the hardware. Before we run the debug application start the TCF profiler by clicking on the run button (circled in red below). Ensure you check the Enable Stack Tracing option.

Starting the Profiler

Once your application completes you will see the TCF profiler populates with information on the execution time of each function.

TCF Profiler Results

At this point I should explain what the terms inclusive and exclusive mean

Exclusive: The amount of execution time spent in function alone.

Inclusive: The amount of execution time spent in function and all of its sub-function calls.

Tracing

Once we have accelerated the hardware we want to be able to ensure the system is functioning optimally this is where tracing comes in. Tracing inserts monitors in the PL and enables us to see the PS, PL and data transfers.

We enable tracing by checking the enable event tracing option when you generate the SDSoC project.

Enabling Event Tracing

Once the build completes when we debug on the hardware we also need to check the debug configuration to say we wish to trace the design.

Enabling tracing in the debug window

Once the tracing has run to completion you will see another project under the project explorer with the project name and traces appended.

Double click on the AXI trace set up and you will see the graph below which shows a graph of the tracing results.

Tracing results

In the above example Orange is SW, Green is Accelerator in the PL and Blue is Data Transfer.

Conclusion

Now that we understand how to install, configure, test, profile, and trace applications in SDSoC, you can get started on project with your Ultra96.

My one day SDSoC Course from FPGA Forum in June 2018

You can find the files associated with this project here:

https://github.com/ATaylorCEngFIET/Hackster

See previous projects here.

More on on Xilinx using FPGA development weekly at MicroZed Chronicles.