By Adam Taylor

We can create very responsive design solutions using Xilinx Zynq SoC or Zynq UltraScale+ MPSoC devices, which enble us to architect systems that exploit the advantages provided by both the PS (processor system) and the PL (programmable logic) in these devices. When we work with logic designs in the PL, we can optimize the performance of design techniques like pipelining and other UltraFast design methods. We can see the results of our optimization techniques using simulation and Vivado implementation results.

When it comes to optimizing the software, which runs on acceleration cores instantiated in the PS, things may appear a little more opaque. However, things are not what they might appear. We can gather statistics on our accelerated code with ease using the performance analysis capabilities built into XSDK. Using performance analysis, we can examine the performance of the software we have running on the acceleration cores and we can monitor AXI performance within the PL to ensure that the software design is optimized for the application at hand.

Using performance analysis, we can examine several aspects of our running code:

CPU Utilization – Percentage of non-idling CPU clock cycles

CPU Instructions Per Cycle – Estimated number of executed instructions per cycle

L1 Cache Data Miss Rate % – L1 data-cache miss rate

L1 Cache Access Per msec – Number of L1 data-cache accesses

CPU Write Instructions Stall per cycle – Estimated number of stall cycles per instruction

CPU Read Instructions Stall per cycle – Estimated number of stall cycles per instruction

For those who may not be familiar with the concept, a stall occurs when the cache does not contain the requested data, which must then be fetched from main memory. While the data is fetched, the core can continue to process different instructions using out-of-order (OOO) execution, however the processor will eventually run out of independent instructions. It will have to wait for the information it needs. This is called a stall.

We can gather these stall statistics thanks to the Performance Monitor Unit (PMU) contained within each of the Zynq UltraScale+ MPSoC’s CPUs. The PMU provides six profile counters, which are configured by and post processed by XSDK to generate the statistics above.

If we want to use the performance monitor within SDK, we need to work with a debug build and then open the Performance Monitor Perspective within XSDK. If we have not done so before, we can open the perspective as shown below:

Opening the Performance Analysis Perspective

With the performance analysis perspective open, we can debug the application as normal. However, before we click on the run icon (the debugger should be set to stop at main, as default), we need to start the performance monitor. To do that, right click on the “System Debugger on Local” symbol within the performance monitor window and click start.

Starting the Performance Analysis

Then, once we execute the program, the statistics will be gathered and we can analyse them within XDSK to determine the best optimizations for our code.

To demonstrate how we can use this technique to deliver a more optimized system, I have created a design that runs on the ZedBoard and performs AES256 Encryption on 1024 packets of information. When this code was run the ZedBoard the following execution statistics were collected:

Performance Graphs

Performance Counters

So far, these performance statistics only look at code executing on the PS itself. Next time, we will look at how we can use the AXI Performance Monitor with XSDK. If we wish to do this, we need to first instrument the design in Vivado.

Code is available on Github as always.

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

First Year E Book here

First Year Hardback here