By Adam Taylor

We last left this example with the AES encryption algorithm running on the PL (programmable logic) side of the device and taking 36010 processor clock cycles. That was the initial result after using SDSoC in a sort of “blind” way with no optimization, which gave a 2% performance improvement over the 36662 cycles needed to execute the AES algorithm in software running on the Zynq SoC’s ARM Cortex-A9 MPCore processor. In this blog, we will use SDSoC’s optimization commands and a few other tricks to significantly cut the number of clock cycles required to perform the encryption.

Accelerating the AES algorithm is slightly more complicated than the matrix multiplication algorithm we previously looked at. This is because the main loop of the AES algorithm is interdependent. That is, the result from one function must be computed before the next function can run.

The strategy I undertook for accelerating the AES algorithm is as follows:

Examine the loops to see where I could unroll them

Optimize the memory bandwidth

Select the correct frequency for the data motion clock frequency

Select the correct frequency for the hardware functions

I should mention that I am using the latest version of SDSoC 2015.2. This is slightly different than the version we were previously using and the new version introduces a configurable projects tab through which we can easily select the function we wish to accelerate and the clocks used for moving the data and for hardware functions.

As discussed in the previous blog post, the main loop of the AES encryption function consists of functions that perform each AES step. Consequently, the algorithm’s main loop consists of interdependent stages, unlike in the previous matrix multiplication example. Each function in the AES algorithm must be completed and the result computed before the next function can run. This is called interdependency and it requires us to use a different approach to acceleration in contrast to the previous pipelining example. To get the best performance for the AES algorithm, we must focus our efforts on the AES steps created as separate functions. There is plenty of potential for optimization within these steps. There is also some data-flow pipelining available for optimization, which we will look at in another blog.

Several AES functions—add round key, substitute bytes, and mix columns—can be pipelined for increased performance. Within these functions, we use the HLS Pipeline command by putting pragmas within the first loop. The inner loop should be unrolled. Several of these functions read from look up tables normally built from BRAM (Block RAM) and the memory bandwidth needs to be increased, so for this example I have specified the pragma parameter “complete” which implements the memory contents as discrete registers as opposed to a BRAM.

The ability to transfer the data between the PS (processor system) and the PL is also of key importance in boosting performance. My first step was to set the data motion clock network at its highest possible clock frequency: 200MHz. The second approach was to ensure that DMA was used for data transfer between the PS and PL. To do this, I had to re-write the interface slightly and use the sds_alloc function to ensure that the data was contiguous in memory, as required for a DMA transfer.

My third and final optimization step was to set the hardware functions lock rate at the highest frequency supported for this application it was 166.67MHz.

When I finally put these all together and build the example the code ran in 16544 processor clock cycles, which is 16544 / 36662 = 45% of the cycles needed when running the AES code in software alone. That’s a massive 55% reduction in execution time for a fairly complex and interdependent algorithm.

The code is available on Github as always.

Now, you can have convenient, low-cost Kindle access to the first year of Adam Taylor’s MicroZed Chronicles for a mere $7.50. Click here.

Please see the previous entries in this MicroZed Chronicles series by Adam Taylor:

Adam Taylor’s MicroZed Chronicles Part 97: SDSoC In depth Example Part 4

Adam Taylor’s MicroZed Chronicles Part 96: SDSoC In-Depth Example Part 3

Adam Taylor’s MicroZed Chronicles Part 95: SDSoC In-Depth Example Part 2

Adam Taylor’s MicroZed Chronicles Part 94: SDSoC In depth Example Part 1

Adam Taylor’s MicroZed Chronicles Part 93: SDSoC Debugging with Linux Part 9

Adam Taylor’s MicroZed Chronicles Part 92: SDSoC Verification & Build Issues Part 8

Adam Taylor’s MicroZed Chronicles Part 91: More on High-Level Synthesis and SDSoC, Part 7

Adam Taylor’s MicroZed Chronicles Part 90: Introduction to High-Level Synthesis and SDSoC, Part 6

Adam Taylor’s MicroZed Chronicles Part 89: SDSoC Optimization, Part 5

Adam Taylor’s MicroZed Chronicles Part 88: SDSoC Part 4—a look under the hood

Adam Taylor’s MicroZed Chronicles Part 87: Getting SDSoC up and running Part 3

Adam Taylor’s MicroZed Chronicles Part 86: Getting SDSoC up and running

Adam Taylor’s MicroZed Chronicles Part 85: SDSoC—the first instalment

Adam Taylor’s MicroZed(ish) Chronicles Part 84: Simple Communication Interfaces Part 4

Adam Taylor’s MicroZed(ish) Chronicles Part 83: Simple Communication Interfaces Part 3

Adam Taylor’s MicroZed(ish) Chronicles Part 82: Simple Communication Interfaces Part 2

Adam Taylor’s MicroZed(ish) Chronicles Part 81: Simple Communication Interfaces

Adam Taylor’s MicroZed Chronicles Part 80: LWIP Stack Configuration

Adam Taylor’s MicroZed Chronicles Chronicles Part 79: Zynq SoC Ethernet Part III

Adam Taylor’s MicroZed Chronicles Chronicles Part 78: Zynq SoC Ethernet Part II

Adam Taylor’s MicroZed Chronicles Microzed Chronicles Part 77 – Introducing the Zynq SoC’s Ethernet

Adam Taylor’s MicroZed Chronicles Part 76: Constraints for Relatively Placed Macros

Adam Taylor’s MicroZed Chronicles, Part 75: Placement Constraints – Pblocks

Adam Taylor’s MicroZed Chronicles, Part 73: Physical Constraints

Adam Taylor’s MicroZed Chronicles, Part 73: Working with other Zynq-Based Boards

Adam Taylor’s MicroZed Chronicles, Part 72: Multi-cycle Constraints

Adam Taylor’s MicroZed Chronicles, Part 70: Constraints—Clock Relationships and Avoiding Metastability

Adam Taylor’s MicroZed Chronicles, Part 70: Constraints—Introduction to timing and defining a clock

Adam Taylor’s MicroZed Chronicles Part 69: Zynq SoC Constraints Overview

Adam Taylor’s MicroZed Chronicles Part 68: AXI DMA Part 3, the Software

Adam Taylor’s MicroZed Chronicles Part 67: AXI DMA II

Adam Taylor’s MicroZed Chronicles Part 66: AXI DMA

Adam Taylor’s MicroZed Chronicles Part 65: Profiling Zynq Applications II

Adam Taylor’s MicroZed Chronicles Part 64: Profiling Zynq Applications

Adam Taylor’s MicroZed Chronicles Part 63: Debugging Zynq Applications

Adam Taylor’s MicroZed Chronicles Part 62: Answers to a question on the Zynq XADC

Adam Taylor’s MicroZed Chronicles Part 61: PicoBlaze Part Six

Adam Taylor’s MicroZed Chronicles Part 60: The Zynq and the PicoBlaze Part 5—controlling a CCD

Adam Taylor’s MicroZed Chronicles Part 59: The Zynq and the PicoBlaze Part 4

Adam Taylor’s MicroZed Chronicles Part 58: The Zynq and the PicoBlaze Part 3

Adam Taylor’s MicroZed Chronicles Part 57: The Zynq and the PicoBlaze Part Two

Adam Taylor’s MicroZed Chronicles Part 56: The Zynq and the PicoBlaze

Adam Taylor’s MicroZed Chronicles Part 55: Linux on the Zynq SoC

Adam Taylor’s MicroZed Chronicles Part 54: Peta Linux SDK for the Zynq SoC

Adam Taylor’s MicroZed Chronicles Part 53: Linux and SMP

Adam Taylor’s MicroZed Chronicles Part 52: One year and 151,000 views later. Big, Big Bonus PDF!

Adam Taylor’s MicroZed Chronicles Part 51: Interrupts and AMP

Adam Taylor’s MicroZed Chronicles Part 50: AMP and the Zynq SoC’s OCM (On-Chip Memory)

Adam Taylor’s MicroZed Chronicles Part 49: Using the Zynq SoC’s On-Chip Memory for AMP Communications

Adam Taylor’s MicroZed Chronicles Part 48: Bare-Metal AMP (Asymmetric Multiprocessing)

Adam Taylor’s MicroZed Chronicles Part 47: AMP—Asymmetric Multiprocessing on the Zynq SoC

Adam Taylor’s MicroZed Chronicles Part 46: Using both of the Zynq SoC’s ARM Cortex-A9 Cores

Adam Taylor’s MicroZed Chronicles Part 44: MicroZed Operating Systems—FreeRTOS

Adam Taylor’s MicroZed Chronicles Part 43: XADC Alarms and Interrupts

Adam Taylor’s MicroZed Chronicles MicroZed Part 42: MicroZed Operating Systems Part 4

Adam Taylor’s MicroZed Chronicles MicroZed Part 41: MicroZed Operating Systems Part 3

Adam Taylor’s MicroZed Chronicles MicroZed Part 40: MicroZed Operating Systems Part Two

Adam Taylor’s MicroZed Chronicles MicroZed Part 39: MicroZed Operating Systems Part One

Adam Taylor’s MicroZed Chronicles MicroZed Part 38 – Answering a question on Interrupts

Adam Taylor’s MicroZed Chronicles Part 37: Driving Adafruit RGB NeoPixel LED arrays with MicroZed Part 8

Adam Taylor’s MicroZed Chronicles Part 36: Driving Adafruit RGB NeoPixel LED arrays with MicroZed Part 7

Adam Taylor’s MicroZed Chronicles Part 35: Driving Adafruit RGB NeoPixel LED arrays with MicroZed Part 6

Adam Taylor’s MicroZed Chronicles Part 34: Driving Adafruit RGB NeoPixel LED arrays with MicroZed Part 5

Adam Taylor’s MicroZed Chronicles Part 33: Driving Adafruit RGB NeoPixel LED arrays with the Zynq SoC

Adam Taylor’s MicroZed Chronicles Part 32: Driving Adafruit RGB NeoPixel LED arrays

Adam Taylor’s MicroZed Chronicles Part 31: Systems of Modules, Driving RGB NeoPixel LED arrays

Adam Taylor’s MicroZed Chronicles Part 30: The MicroZed I/O Carrier Card

Zynq DMA Part Two – Adam Taylor’s MicroZed Chronicles Part 29

The Zynq PS/PL, Part Eight: Zynq DMA – Adam Taylor’s MicroZed Chronicles Part 28

The Zynq PS/PL, Part Seven: Adam Taylor’s MicroZed Chronicles Part 27

The Zynq PS/PL, Part Six: Adam Taylor’s MicroZed Chronicles Part 26

The Zynq PS/PL, Part Five: Adam Taylor’s MicroZed Chronicles Part 25

The Zynq PS/PL, Part Four: Adam Taylor’s MicroZed Chronicles Part 24

The Zynq PS/PL, Part Three: Adam Taylor’s MicroZed Chronicles Part 23

The Zynq PS/PL, Part Two: Adam Taylor’s MicroZed Chronicles Part 22

The Zynq PS/PL, Part One: Adam Taylor’s MicroZed Chronicles Part 21

Introduction to the Zynq Triple Timer Counter Part Four: Adam Taylor’s MicroZed Chronicles Part 20

Introduction to the Zynq Triple Timer Counter Part Three: Adam Taylor’s MicroZed Chronicles Part 19

Introduction to the Zynq Triple Timer Counter Part Two: Adam Taylor’s MicroZed Chronicles Part 18

Introduction to the Zynq Triple Timer Counter Part One: Adam Taylor’s MicroZed Chronicles Part 17

The Zynq SoC’s Private Watchdog: Adam Taylor’s MicroZed Chronicles Part 16

Implementing the Zynq SoC’s Private Timer: Adam Taylor’s MicroZed Chronicles Part 15

MicroZed Timers, Clocks and Watchdogs: Adam Taylor’s MicroZed Chronicles Part 14

More About MicroZed Interrupts: Adam Taylor’s MicroZed Chronicles Part 13

MicroZed Interrupts: Adam Taylor’s MicroZed Chronicles Part 12

Using the MicroZed Button for Input: Adam Taylor’s MicroZed Chronicles Part 11

Driving the Zynq SoC's GPIO: Adam Taylor’s MicroZed Chronicles Part 10

Meet the Zynq MIO: Adam Taylor’s MicroZed Chronicles Part 9

MicroZed XADC Software: Adam Taylor’s MicroZed Chronicles Part 8

Getting the XADC Running on the MicroZed: Adam Taylor’s MicroZed Chronicles Part 7

A Boot Loader for MicroZed. Adam Taylor’s MicroZed Chronicles, Part 6

Figuring out the MicroZed Boot Loader – Adam Taylor’s MicroZed Chronicles, Part 5

Running your programs on the MicroZed – Adam Taylor’s MicroZed Chronicles, Part 4

Zynq and MicroZed say “Hello World”-- Adam Taylor’s MicroZed Chronicles, Part 3

Adam Taylor’s MicroZed Chronicles: Setting the SW Scene

Bringing up the Avnet MicroZed with Vivado