This is part of a series of posts detailing the steps and learning undertaken to design and implement a CPU in VHDL. Previous parts are available here, and I’d recommend they are read before continuing.

In the last part we got to the point where RISC-V code, built with GCC, could run and display text over HDMI and blink LEDs. However, this could only run from the 192KB of Block RAM we initialized within the Spartan7 FPGA on our Digilent Arty S7-50 board. Whilst 192KB is a nice amount of on-FPGA fast storage, we have a 256Mbyte DDR3 chip sitting next to the FPGA which is crying out for use. This post follows on from part 16 and integrates a DDR3 memory controller provided by Xilinx, and using an SD card Pmod adapter, load code from the SD card into that memory for use.

The DDR3 memory chip on the ArtyS7 seems to vary from board to board, but the timing specifications should be compatible for all. For clarity, the chip on my development board is a PMF511816EBR-KADN.

It connects with address and data signals direct to the FPGA. There are many other control signals, but we are not going to get into how DDR3 memory works in this post, as Xilinx Vivado comes with a wizard for generating memory interfaces. The Memory Interface Generator (MIG) seems to be an older utility and I had a few problems with it. However, in the end, I did get a working interface generated.

Memory Interface Generator

Digilents documentation for the ArtyS7 board includes various files for importing into MIG to assist with generation of the memory interface. In MIG itself, you will not see an option for importing settings explicitly – you need to select that you are verifying an existing design. Before the following section walking through the wizard, ensure your Vivado project is located in a local drive – I generally worked on a mapped network drive, and the build will fail to implement the controller at synthesis time if the project is on a network drive. Another item to note – you must go through the IP Catalog. When I tried to I initiate the MIG through the Block Design view that I used for generating BRAM in the previous article, VHDL solutions could not be generated.

We go through the wizard, using the Digilent files to configure the memory controller. Generally it is a case of hitting next though each screen, except from a few pages at the start where you need to input the file paths, and validate the pins – but the validation should succeed with a few warnings.

The next screen was quite something, with the wizard saying the browse buttons for file paths were not supported on Windows 10.

Validating the pins above will only generate a few warnings. Click next through the rest of the wizard and generate the design.

When the implementation of the memory controller has completed, which took minutes on a Ryzen 2700X+SSD, we can see it in the IP pane. At this point, you can open an example project by right clicking the IP in the hierarchy.

The example project can run in the simulator very slowly (initialization took many seconds in the waveform before real simulation begins). You can see the VHDL component definition to use for the controller, and I copied that into the RPU project.

The generated design uses the User Interface(UI) implementation to the memory controller, which preserves things like request ordering. There is a Xilinx document UG586 which details the signals used in this interface, including example waveforms for various read and write request states.

The UI used to utilize the memory controller is good for the use cases that I want: request ordering is preserved, burst reads/writes, writes with byte enable. I originally wanted to use the Native interface – which you can read about in document UG586 if you want to know more – but what I wanted was very basic. The native interface is significantly more involved to use as things like out of order request completion can occur.

In addition to the main data and control busses you’d expect, there are a few other signals which need connected in VHDL for the memory controller to operate. On ArtyS7, we need to feed it both 100 and 200MHz clocks. Additionally, we need to ensure the controller is not used after a reset until the initialization complete signal is asserted from the memory controller internals. One other major item of importance – the controller must be driven off of a ¼ memory interface clock, which for us is 81.25MHz. This is a good bit down on our current 200MHz CPU clock, but for now I just ran everything at this clock frequency, which is provided by the controller itself as an output.

The documentation from Xilinx shows how read and write transactions operate. The interface to the controller supports multiple back to back reads and writes for highest performance, but we only want basic single requests.

Writes follow this pattern:

Reads follow this:

We can implement these easily in our VHDL using the already existing MEM_proc memory request process. We will just add some new states to implement the various stages of a new state machine, and we should be good. The writes can be implemented in one of three ways, as explained in the above waveform diagrams. We are using method 1, asserting the *wdf signals for the write command data at the same time as the initiation of the commands using app_cmd.

As briefly mentioned earlier, we need to handle resetting of the memory controller. We activate the reset signals for the memory controller and CPU core for the same length of time. This is implemented with a decreasing 100MHz counter, initialized to some value in the thousands, with the resets being deactivated at 0. The DDR3 physical interface always initiates a calibration sequence after reset. When this operation is complete, a controller output signal is asserted which we should check for before sending memory requests.

Whilst making the requests and dealing with those states was straightforward, addressing and the resultant data output ordering was far from easy. I knew from MIG documentation that this controller would address 16 bits, so we just need to lop off the least significant bit from our memory request before passing it to the memory controller UI. However, after writing a small test program – which just wrote a known, incrementing pattern to contiguous memory addresses – it was obvious that my understanding of something else was awry. The data seemed to write 32 bits okay, but reading at different address offsets was quite the challenge. The Xilinx documentation does not explain the data organisation for the 128-bit data signals which come out of the controller. I assumed a simple burst read of [addr, addr+1, addr+2] data would present itself, but this didn’t seem to be the case. I memory mapped the internal signals in and out of the Memory Controller, and wrote another test which read and wrote to DDR3 memory, and dumped the 128-bit read buffer as well. As you can see from below, there were some really odd things going on:

I managed to find patterns for all the different read modes we needed, and the naturally aligned addresses for 4, 2 and 1 byte reads and writes. For writes, I used some lookup tables to assist with generating the byte enable signal – the 128bit write data input was byte masked, so that writing values which are not 128 buts in length do not appear to require a read-modify-write operation.

As it stands, I have still not found out why the read operations result in odd looking data from the controller. All things considered, it’s likely I have made an error early on in the implementation of the data swizzling for writes using the controller, which then dominos into the read operations. It’s something I’ll be returning to. This is fine for us just now, but I was intending on using the burst data to fill caches in later work on the CPU, and for that to work efficiently we will need to fully understand how this burst data organization works.

So. 256MB of ram unlocked, and passing fairly basic read/write tests. But the latency is very slow! Much slower than expected. I added some counters to the read state machine, and it seems to take around 22 cycles for a read. At 82MHz this is a long time. However, when searching online, it seems others have also seen this kind of latency, so I did not look into it – and instead tried to decouple the DDR memory clocking from that of the CPU and the rest of the SoC, like block rams.

Clock Domains

A clock domain can be imagined by separating out all aspects of your design into blocks. For each block which runs off a certain clock, it is on that clocks domain. Generally, blocks which synchronize off the same clock, and therefore are in the same domain, can pass signals between them relatively easily.

Currently, we have many clock domains in the SoC, but only 1 in the CPU side of things. The other clock domains are for HDMI output and due to the Block Rams being dual ported with dual clock inputs, we do not need to pass any raw signals across clock domains. For our DDR3 and CPU clocks to be different however, multiple signals would need to cross clock domains.

There are numerous issues that can arise when trying to pass synchronized signals across different clock domains. The simplest issue is when trying to read a fast clock signal from a slow clock – and the slow clock completely missing a strobe of the faster signal.

There are various methods of crossing signals from one clock to another, with some basic techniques starting with using multiple flip flops to latch signals into other domains. In VHDL, this just looks like a process which reads and latches the value from one signal into another on the destination clock.

As we are dealing with single read and write transactions, I decided to cross the CPU to Memory Controller clock domain using state handshaking. We will use two integer signals, which have multiple versions for stability in each clock domain, to implement a state machine process which crosses the clock safely. This can introduce several cycles of additional latency, however at the moment we are trying for a working solution – not a very efficient one. We can give up a few cycles of latency in the CPU clock domain if this means we can run the CPU much faster than the memory controller. The dataflow for a read looks like the following.

And writes:

The way this is implemented is with two processes, one clocked off the Memory Controllers 81.25MHz, the other off the higher CPU clock. The CPU process is actually just the MEM_proc one from before, to make interfacing with the CPUs own memory request logic easier. Each process reads the state of the other using stable signals which are latched into their own clock domain in an attempt to avoid metastability.

When both differently clocked processes are in a known state, I read the memory data required across the clock domain. I am not sure this is strictly safe – some say things such as data buses should be passed through a FIFO.

It took a few attempts to get this right, mainly due to realising that now I could not run the CPU at previously attainable speeds. It seems that just having the DDR3 controller synthesized into the SoC meant that the CPU core could not get over 166MHz. I have settled on 142.8MHz for now (1GHz/7 – easily attainable from my existing clocking system) as the CPU clock – fast enough for most needs but low enough that any timing issues do not arise. Many additional timing warnings appear on synthesis when the DDR3 controller is included, which was unexpected since this controller was generated by the internal tools. Like I mentioned last part, I do intend to look into these warnings and understand them with an aim of resolving them to unlock higher clocks. For now, 142MHz will do fine – and the additional memory space of DDR3 is very welcome!

With the DDR3 memory integrated, I also made quick microSD PMOD adaptor out of a level-shifting microSD adaptor intended for 5v Arduino projects using SPI. The level shifter is not needed for our purposes, as the FPGA already runs on 3.3v logic. No modification of the part was required – just soldering the pins to the 0.1” PMOD headers was enough to work.

The SPI port is exposed externally on Pmod JD. The way the SPI port is accessed is the same as discussed in the previous posts – just different memory mapped addresses. I created a new bootloader for the internal BRAM to perform a small memory test (which validates the endian byte swizzling), then initialize SPI and mount an SD card if present. It will examine a BOOT elf file and copy it into the required physical memory location, which can be in DDR3 or BRAM. It will then jump to that elf entry point to continue execution.

FPGA Utilization

With this all coming together to form a SoC design, I looked into the utilization report, which tells you how many resources your design is using. The report looks as follows:

The DDR3 memory controller takes up the largest amount of resources – not unexpected, they are incredibly complex! I’m quite happy with the current utilization of the RPU core itself. I’ve not been looking into optimization for resources, so having the CPU take up less than 6% of the Spartan S7-50 seems good. Looking at the breakdown of where the utilization of resources comes from, it seems the internal management of memory requests is taking a larger amount of resourced than I’d expect. It will be interesting to see how this changes when the endian swizzle logic is brought into the CPU core, and eventual cache logic added.

That bring this part to a close. I am now looking into implementing the required RISC-V Control and Status Registers (CSRs) into RPU. Currently I have memory-mapped data that should be available though CSR instructions. Adding them will be interesting as it will require changes to the CPU interface. Read about that next time!

As mentioned last time, the code for the DDR3 controller is already up on github. I still need to check that I’ve implemented the read/write data swizzling to the DDR3 correctly – I’ve a feeling there is a mistake in the writes somewhere, which then impacts the reads.

Thanks for reading, and as always, any questions can be directed at myself on twitter @domipheus.