A short time ago we looked at the Ultra96 board, one of the great use cases for the Ultra96 is when it comes to implementing machine learning at the edge.

This is mainly due to the use of the Zynq MPSoC which provides both Arm processor cores and programmable logic. The heterogeneous device enables us to correctly architect a solution which achieves the balance between power and performance, crucial parameters for edge-based applications.

The Ultra96 is a good platform for edge-based machine learning.

We can achieve this balance between power and performance as we use the programmable logic to accelerate the machine learning function. This results in increased performance over a purely PS-based solution, and if implemented correctly, this will offer significant power advantages too.

To demonstrate the acceleration capabilities, Xilinx provided a reference design based on a Binary Neural Network (BNN) for the Ultra96 and the ZCU102.

As this is a good place to start if we wish to develop our own machine learning application, I thought it would be a good idea to look at how we get this demo up and running on the Ultra96.

However, before I do that I should explain that what a Binary Neural Network is. When we implement neural networks in our designs, especially convolution neural networks like is the case when we work with 2D inputs such as images, we have several interconnected layers. These layers will consist of arrangements of the following:

Convolutional layers — A convolutional layer applies a sliding filter kernel to the input creating an activation map. One of the reasons it is called an activation layer is the output of the filter kernel will be numerically high when it contains a specific feature of interest.

Max Pooling — This performs a subsampling reducing the size of the 2D array being worked on.

Fully Connected Layers — This is a matrix multiplication layer which applies both weights and a bias to the input. It is called a fully connected layer as the input and output are fully connected.

In the case of the Xilinx example, this contains six convolution layers, two max pooling layers, and three fully connected layers.

Of course, to get our neural network to correctly perform its task, it first requires training. Training is how we calculate the weights, biases, and strides needed to configure each of the layer.

To achieve a high level of accuracy in our network training requires both time and large data sets of images. For this example 50,000 images and 8 hours of amazon web server time (p2.xlarge instance) were required.

Typically, when we develop and train these neural networks, we are working with floating point numbers on a GPU, high performance PC, or cloud-based acceleration.

When we want to accelerate neural networks into programmable logic, fixed-point implementations, e.g. fixed-point eight, are easier to implement and can be made to offer similar performance for lower power. As many machine learning applications are edge-based, power efficiency is critical.

Binary Neural Networks are networks that take this a step further and set the weights and biases to either -1 or 1.

Naturally, as the precision of the weights and biases are reduced, so too does the accuracy of the result. However, a binary network can be retrained and the network size can be increased to recover the lost accuracy.

One example which we can get up and running quickly and easily on the Ultra96 Board is accelerated image classification (AIC) using a BNN. Getting the example up and running is very simple, as we can download the SD card image for the Ultra96 and ZU102 boards here.

To get it up and running, all we need to do is write the image to a SD card and then select the appropriate bin and image files on the SD card for the target board. We also need a monitor that supports DisplayPort.

Once the example is up and running on our Ultra96, we can select which of the four examples we wish to run. The choices are HW or SW implementations for 4K or HD from the desktop.

In each of these examples, we get the choice of running from a web camera or from an internal database. Once you have made this selection, you can run the example and see it classify images just like in the video below.

As you can see from the video, the performance difference between running the BNN on the programmable logic compared to just running on the Arm cores is considerable. The accelerated version is able to process 69 images per second compared to wo tiles per second in software, quite a significant acceleration.

Having watched the example run, should you may want to look more in-depth at how this performance is achieved, you can download all of the source files necessary to rebuild the example application here along with full instructions.

Regenerating the design in Vivado you will notice the BNN acceleration is implemented using HLS and integrated rather simply into the Ultra96 Arm cores using only two AXI interfaces one PS master and one PS Slave.

As would be expected, the BNN itself has been implemented using High Level Synthesis. HLS is a perfect choice for this, as it allowed the developers to work with a higher level language and one more suited for describing neural networks than an RTL to achieve the desired acceleration.

Ultra96 block design when the code is rebuilt.

Of course, if we can recreate the project from source the logical next question is can we update or modify weights and biases? The answer is yes, it is possible to update the network weight and biases, and the design by changing the files located under:

/aic/src_hw/bnn-pynq-mpsoc/params/

However, that does mean we need to be able to train the network first to generate the new weight and bias values. This will be the subject of a Hackster project soon.

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Additional Information on Xilinx FPGA / SoC Development can be found weekly on MicroZed Chronicles.