This is the story of finding a bug in the ZipCPU.

It starts much like any other FPGA story: somewhere else.

Fig 1. Digilent's Arty Board

Some time ago, Digilent replaced the flash chip within their Arty FPGA board. They also created a line of new FPGA boards, so my Arty board has now been rebranded as the Arty A7. I never realized there was a difference until someone wrote to tell me the design didn’t work anymore. With a little bit of digging, he and I discovered that the flash chip had changed. The new flash chip wasn’t just another chip from the same vendor, it was now from a different vendor entirely: from Micron to Spansion.

This broke my old flash controller.

Not a problem, I thought to my self, I was hoping to write a blog article on how to build a Universal Quad SPI flash controller. This new controller needed to be tested and proven in real hardware. Let me just place this universal controller into my OpenArty design and then all shall be well. Indeed, all shall be better: my universal Quad SPI flash controller runs at twice the speed. This will be a nice upgrade!

But what about the differences between the two flash chips? I can use AutoFPGA to help me select between configurations of this universal flash controller.

Fig 2. Original OpenArty Design

The only problem is that the OpenArty design wasn’t an AutoFPGA design to begin with. Ok, that’s not really a problem, we can update the OpenArty design so that it uses AutoFPGA. That will make it even easier to test my HyperRAM controller as part of my OpenArty project as well.

The next problem is that the original OpenArty design requires an external serial port in addition to the one on the board. This may have kept folks from using the design, and so my newer designs have been multiplexing a console port onto the serial debugging port. This is a fairly simple upgrade, so we can do that too.

Further, when I was struggling to get the flash controller working, I had just had a bad experience with the ZipCPU on the iCE40s that had forced me to change my default linker script(s). After a quick upgrade to AutoFPGA, it now supports multiple custom linker scripts, but that now meant that the bootloader needed to change as well.

Other things had changed and needed to be updated as well. For example, the ZipCPU had now been formally verified. I found a lot of bugs in that process some time ago, and so I was excited to be updating the design with the new formally verified CPU.

In the middle of this, I chose to switch to Vivado 2018.3 from Vivado 2016.3.

Is this starting to sound all too familiar?

Years ago, when I studied Software Engineering, we would call each of these a “Small matter of Programming” or SMOP. This is said in a tongue-in-cheek fashion, however, because nothing is truly ever a “Small matter of Programming”. As an engineer I also tend to (inappropriately) judge each of these changes to be small and simple matters.

The result was that, with so many changes, I was threatening to violate the fundamental assumption of all debugging: “There’s only ever one bug in the design.” Sure, I know, that gets violated all the time, but think about it: most debugging methodologies can only handle one bug at a time.

So, today’s blog post is about the “final” bug in this process, and one I certainly wasn’t expecting.

Background

We’ll pick up the story from the point where I was working on my new flash controller. In my mind, this was the only big thing that had changed in this design:

I knew the CPU was ready. That was the easy part.

The AutoFPGA updates, those were a bit harder. I had now switched conventions regarding how I referenced peripherals. All peripherals at fixed addresses now had names starting with an underscore like _flash , _sdram , _clrleds , and so forth. These all had to be changed, and so the OpenArty ZipCPU software all needed to be adjusted.

I also wanted to bring up the Arty with fewer capabilities–just to work one by one through them. This meant a limited AutoFPGA build, but it also broke many of the programs that now depended upon non-existant hardware registers–since they weren’t part of the limited build. This was easily fixed through the per-component #define macros provided by AutoFPGA, but it still required software to be updated to use those macros.

That brought me to the new flash controller. I knew this flash controller was ready, but not yet hardware proven. That was okay, since I was going to use this opportunity to verify my new universal QSPI flash controller worked on real hardware, to finish testing the driver for it and to blog about it next. I just needed to get it working–at 2x the clock and data speeds. I expected this to take a bit of work.

Indeed, the flash controller took a lot of work. I struggled to read and understand the timing of both Xilinx’s IDDR and ODDR modules. I was struggling with things like the flash writing properly, but not reading properly. Or, perhaps worse, it would read mostly properly. This killed my loader, since I would write the ZipCPU program to the flash, and then attempt to read my data back in order to verify it was correctly written. If the reads were only somewhat reliable, then my flash driver would always try to reprogram the flash. This reprogramming happened nearly all the time, even when reloading the same program with the same instructions, indicating that something was broken.

This also left me wondering if the CPU was reading the right program from the flash.

For a while, I went forward with this “broken” flash controller anyway.

I got as far as the demonstration program that is designed to demonstrate the Arty’s 100MHz MII ethernet port. This software handles ARP transactions, while sending pings to a nearby host. It’s also supposed to be able to respond to a pings request. The program was failing. For some reason it was locking up mid-design.

I figured it must be a problem with my new flash controller. Perhaps the CPU was misreading its instructions from the flash.

While this post isn’t about the bug in the flash controller, that can wait for a later post, it is important that you understand what was going on because it sets the background for what follows. Basically, I was able to write anything I wanted to the flash without any problems, but reads were unreliable. It might be that they were 99.99% reliable and then there would be a bit flipped. I figured this could only be a timing problem, and that I was going to need to write and blog about how to build a synchronization circuit.

In other words, when the network program failed, I had no idea where within the design to look for the problem.

MAX BAXTER from twitter suggested I change the drive strength on the flash. The Micron flash chip has several control registers, among them are the Extended Volatile configuration register and the Extended Non-Volatile configuration register. Both of these registers can be used to adjust the drive strength in Ohms. (Drive strength? Shouldn’t the units of drive strength be Amps and not Ohms?) The first adjustment I tried worked, and so my flash controller, was finally working!

So the network ping program should too, right?

Wrong.

Okay, that was my last working hypothesis. Now what’s wrong?

Symptoms

So let’s start by taking stock of this situation. My goal at this point is to verify that the MII Ethernet controller was fully functional, even after redesigning the top level of the project to use AutoFPGA.

Flash controller My new flash controller now worked. It could read and write the flash at an SPI clock rate twice what it was before, and now equal to the system clock rate. If I wrote a design to the flash at this point, I could read it back at any time to verify the correct design has been written. There were now no more differences between the data I wrote to the flash and the data I read back out, so I figured there were no more bugs in the flash controller.

Fig 3. Packet transactions required to get a ping response

% wbregs -m simple_ping.map user_heartbeats

I knew my program had stopped once this number had stopped incrementing. This is also how I can (somewhat) tell what’s going on within my design even while a program is running, without needing to stop it.

Ever been in this situation?

Now let me ask, how would you approach debugging this issue?

Let’s take a quick look at the tools we have available to work with to find this bug.

Often the first method I turn to when I have a bug like this is “Voodoo computing.” Defn: Voodoo Computing To change what isn’t broken, in an effort to fix what is In good voodoo programming fashion, I disabled the data cache. It didn’t help. I tried playing with the network control port. It didn’t help either. In many ways, most of the on-line forum requests I find, on either Digilent’s or Xilinx’s site, are Voodoo computing requests. “I don’t know what’s going on, so I changed something I didn’t understand, and now my design still doesn’t work.” Sure, “Voodoo computing” feels good, but it rarely leads you any closer to understanding what is going on within the design. On the other hand, your likelihood of success with Voodoo computing is still better than your likelihood of winning the lottery.

The other end of the spectrum of bug-fixing approaches is to guarantee that you never have the bug in the first place. That’s the purpose of both formal verification and simulation. My CPU passed both formal verification and simulation, but was still failing in this instance. Now what? I went back and double checked the proof of the CPU and the network control interface. They still passed. While you might be able to simulate a commercial CPU, it is unlikely that you’ll ever be able to get access to the code or formal properties necessary to formally verify one.

As I mentioned above, the ZipCPU debugger fails Normally, I’d try to see what’s going on within the debugger. However, a failing debugger doesn’t help much. Most commercial CPU’s have a debugger you can use. Sometimes it helps. Often, as in this case, it doesn’t since few debuggers can examine a program struggling with interrupt and non-interrupt contexts.

Global Variables My simple_ping program contains a variety of global variables.

unsigned pkts_received = 0 , replies_received = 0 , arp_requests_received = 0 , arp_pkt_count = 0 , arp_pkt_invalid = 0 , arp_missed_ip = 0 , arp_non_broadcast = 0 , ip_pkts_received = 0 , ip_pkts_invalid = 0 , icmp_echo_requests = 0 , icmp_invalid = 0 , ping_reply_address_not_found = 0 , ping_replies_sent = 0 , ping_reply_err = 0 , user_tx_packets = 0 , user_heartbeats = 0 ;

I do this on purpose. Each of these particular variables is a counter of some type. I can then use them to see if or when things are happening with the CPU or not.

By compiling my simple_ping program with the linker flag, -Wl,-Map=simple_ping.map , the linker will then create a file for me containing the final addresses of all these global variables. Indeed, this is the very reason why they are global and not local variables: so that they will have a fixed address in memory.

If you scroll through this map file, you’ll find lines looking like,

.bss 0x000000000700c490 0x68 obj-zip/simple_ping.o 0x000000000700c490 gbl_picv 0x000000000700c494 subbeats 0x000000000700c498 heartbeats 0x000000000700c49c pkt_id 0x000000000700c4a0 router_mac_addr 0x000000000700c4a8 ping_rx_count 0x000000000700c4ac ping_tx_count 0x000000000700c4b0 ping_mac_addr 0x000000000700c4b8 user_heartbeats 0x000000000700c4bc user_tx_packets 0x000000000700c4c0 ping_reply_err 0x000000000700c4c4 ping_replies_sent

These lines are created as the linker places the simple_ping object file into my executable. At that time, these variables can be found within the BSS segment. Once these values are placed within the final executable, their addresses are then given on the left hand side, with the variables names on the right.

Originally, I would read values from this list by looking up their addresses and then reading them with my wbregs command. This command implements a basic peek/poke capability within the design. The result might look something like:

% ./wbregs 0x0700c498 0700c498 ( ( null )) : [ ....] 00000000

This would tell me that the CPU had never gotten to the heartbeats++ statement, and so I could track down what was taking place across interrupts even–without needing to disturb the running CPU.

In this case, I knew the CPU was failing. These global variables gave me a good idea of where it was failing. I just had no idea why it was failing.

This approach would work well with any CPU, commercial or otherwise, as long as your design has some form of debugging bus within it.

Fig. 5: Looking back into the past

In the end, it was the wishbone scope, together with formal methods, that finally helped me find the bug.

To do this, I had to modify the trigger for the scope. That’s the next part of this story.

Normally, I have the CPU’s scope trigger off of any internal exception that would cause the CPU to halt unnaturally.

initial debug_trigger = 1'b0 ; always @ ( posedge i_clk ) debug_trigger <= ( ! i_halt ) && ( o_break ) ;

This wouldn’t work in this case, because the whole problem was that the CPU hung without properly halting.

So I adjusted my trigger logic. I chose to count the number of clocks since the last valid instruction retired.

Within the ZipCPU, any time a non-memory instruction retires, it sets the alu_pc_valid flag. This tells the CPU that the program counter needs to be updated. If alu_pc_valid isn’t true for many cycles, then we can conclude that the CPU must be hanging.

// Look for a hang of 2048 cycles reg [ 10 : 0 ] dbg_trigger_counter ; always @ ( posedge i_clk ) if ( alu_pc_valid ) dbg_trigger_counter <= 0 ; else dbg_trigger_counter <= dbg_trigger_counter + 1'b1 ; initial debug_trigger = 1'b0 ; always @ ( posedge i_clk ) debug_trigger <= (( ! i_halt ) && ( o_break )) || ( & dbg_trigger_counter ) ;

That’s the idea.

The problem is that I only want to know when the CPU hangs. What if it hasn’t been started in the first place? What if I stop it in the debugger? What if the CPU is in sleep mode waiting for an interrupt?

I needed to adjust the counter just slightly to avoid these cases.

always @ ( posedge i_clk ) if ( alu_pc_valid ) dbg_trigger_counter <= 0 ; else if (( ! i_halt ) && ( ! sleep )) dbg_trigger_counter <= dbg_trigger_counter + 1'b1 ;

Now the scope will trigger after 2048 clocks of the CPU not updating the program counter.

But would this trigger on the bug?

The Trace Result

Yes, the scope triggered! The next time I ran the simple_ping program on the CPU it hung as I hoped and expected. As a result, I could check the trace capability and see that, yes, the scope had triggered and there was a trace waiting for me. (Yes, I’m skipping over writing the software to read from the scope and to decode its results–we’ve already discussed that before.)

Since we are working with text, and not VCD files, every line shows one more clock step of what’s going on. The lines will start with the location within the scope’s buffer, followed by the 32’bit word located there. After that, the line follows a custom ad-hoc decoding script.

6059 04012512: R1 <- 0x.0012512 6060 0c7b3cd9: R3 <- 0x.07b3cd9 6061 0c7b3cda: R3 <- 0x.07b3cda

We’ll pick it up at line 6059 out of 8191. While it can be annoying to page through thousands of lines of output just looking for the bug, in this case, the bug was at the bottom of the trace. Further, once the CPU halted, all the trace words were identical–making it easy to quickly scan about two pages of information from the bottom of the trace to find the bug.

At line 6059, the trace looks fairly normal. The CPU simply writes to registers R1 , then R3 , and then R3 again.

Then there is a cycle where nothing is written to the registers. This is likely the result of a CMP (compare) instruction. Such an instruction consumes one clock, but doesn’t write to any register values.

6062 5885b440: FLAGS 5885b440 CE[DOA ] V[PDO ] MCE

Fig. 9: Pipeline stages of the ZipCPU

This FLAGS response contains a large bit-field. To help understand it, I’ve shown a model of the ZipCPU pipeline in Fig. 9 on the right.

I’ve tried to break out some of the trace fields in the line above. For example, the V[*] section tells me which pipeline stages have valid entries within them: the prefetch, decode, and read-operands stage. It also tells me that the master CE ( MCE ) signal, the signal that controls whether any instruction transitions from the read operands stage to the execution units, is high. Given that we have valid instructions, and MCE is high, the CPU is running at this trace point. Finally, the trace line shows me which pipeline stages are advancing. These are the ones with their respective *_ce flags high.

As we discussed before, the basic pipeline operation follows as,

always @ ( posedge i_clk ) if ( clear_pipeline ) stage_valid <= 1'b0 ; else if ( stage_ce ) stage_valid <= prior_stage_valid ; else if ( next_stage_ce ) stage_valid <= 1'b0 ; always @ ( posedge i_clk ) if ( stage_ce ) begin // Copy data from the previous pipeline // stage to this stage, operating on it as // necessary end

If the stage’s CE line is high, the instruction moves from the previous stage into the current stage stage.

In this case, instructions will be advancing from the P refetch to the D ecode stage, from the D ecode stage to the read O perands stage, and from the read O perands stage to the A LU stage. That’s the meaning of the cryptic CE[DOA ] field. This is all good: the pipeline is full, and the CPU is operating normally–it just so happens that it is not writing to any register values on this cycle. Again, this is normal.

Following the compare ( CMP ) instruction, there’s a branch instruction. This shows up as taking two clocks. The first one is where the program counter is given a new value. This is the same register write format we discussed above in Fig. 6. The second line shows where the CPU actually sets the program counter and clears the pipeline. This is the same format we showed in Fig. 7 above.

6063 3e801814: PC <- 0x.2801814 6064 41a00605: JMP 0x06801814

At this point, we can finally identify where in the program code we are at. By examining the disassembly of our simple_ping program, we find that we are in a tight loop waiting for a network packet to be received, or for the seconds-hand on the real-time clock to advance.

while(((_netp->n_rxcmd & ENET_RXAVAIL)==0) &&(_rtc->r_clock == rtc)) user_heartbeats++;

Here you can see where I set one of those global variables, user_heartbeats . This allowed me to tell, when using the global variable debugging approach, when the program was hung, since this value would stop changing.

Following the conditional branch, the CPU needed to clear its pipeline. This required a couple of clock cycles, which are shown in the trace lines below. As before, the CE[] block shows which pipeline stages, D ecode, O perands, or A LU are accepting data, whereas the V alid block shows which stages actually have valid instructions within them. Hence, we are now watching a new instruction stream move into the pipeline. The first of these new instructions will eventually write to the R2 register, as shown below.

6065 58810000: FLAGS 58810000 CE[D ] V[ ] MCE 6066 58850000: FLAGS 58850000 CE[D ] V[P ] MCE 6067 5885a000: FLAGS 5885a000 CE[DO ] V[PD ] MCE 6068 5885b400: FLAGS 5885b400 CE[DOA ] V[PDO ] MCE 6069 087b3cda: R2 <- 0x.07b3cda

The next instruction is a load word instruction. This instruction issues a bus transaction to read from location 0x03800000 in memory.

6070 70e00000: MEM-R[0x.3800000] -> (Not Givn) (GBL)

While the scope hasn’t captured all 32-bits of that address, I can infer it from the fact that my design doesn’t have more than 27-bits of address space, and so the number of bits is sufficient.

This read polls the network interface, to query if any packets have been read.

The (GBL) flag indicates that this read is from the global address space, rather than the processor’s local peripheral address space.

As you might imagine, reading a value from the bus stalls the processor. While there are waiting instructions in the prefetch, decode, and read operands stages, nothing moves forward until this result comes back–lest the CPU need to undo and re-perform any instructions. (The ZipCPU is not an “out-of-order” machine.)

6071 5884d00c: FLAGS 5884d00c CE[ ] V[PDO ] MCE DCD-STALL MEM-BUSY

While it is waiting, we get another flags trace line. In this case, we can see that the memory unit is busy, MEM-BUSY . DCD-STALL is an indication that there is an instruction in the decode stage of our pipeline that is stalled. We’ll come back to this later.

The memory instruction takes about five clock cycles. During this time, nothing changes within the CPU, and so the wishbone scope control program outputs the line **** **** so we can tell that nothing is changing.

Once the memory read operation is complete, the result is written to R1 .

**** **** 6076 04020000: R1 <- 0x.0020000

So far this all looks good. We haven’t hit the bug yet.

The CPU then issues an AND instruction, to check whether or not the network interface has received any packets. The result of this instruction is zero: there are no pending packets waiting.

6077 04000000: R1 <- 0x.0000000

Based upon the zero flag, we then hit a conditional branch, and so we need to clear and load the pipeline again.

6078 3e80182c: PC <- 0x.280182c 6079 41a0060b: JMP 0x0680182c 6080 58810000: FLAGS 58810000 CE[D ] V[ ] MCE 6081 58850000: FLAGS 58850000 CE[D ] V[P ] MCE 6082 5885a000: FLAGS 5885a000 CE[DO ] V[PD ] MCE 6083 5885b020: FLAGS 5885b020 CE[DO M] V[PDO ] MCE

This time, we’re going to read from address 0x05800060 : the real-time clock register.

6084 71600018: MEM-R[0x.5800060] -> (Not Givn) (GBL) 6085 5884d00c: FLAGS 5884d00c CE[ ] V[PDO ] MCE DCD-STALL MEM-BUSY **** **** 6090 04012512: R1 <- 0x.0012512

The result, 0x012512 , from the clock is in Binary Coded Decimal, or BCD. You might read it as 1:25:12 . It indicates that it has now been one hour, twenty five minutes, and twelve seconds since this design was initially loaded. (No, I never set the clock to real time–I was more focused on debugging all this time.)

Again, so far this looks normal.

Then something different happens. The trace ends.

6091 5884d000: FLAGS 5884d000 CE[ ] V[PDO ] MCE DCD-STALL **** **** 8127 d884d000: TRIG FLAGS d884d000 CE[ ] V[PDO ] MCE DCD-STALL <--- TRIGGER 8128 5884d000: FLAGS 5884d000 CE[ ] V[PDO ] MCE DCD-STALL **** **** 8191 5884d000: FLAGS 5884d000 CE[ ] V[PDO ] MCE DCD-STALL

Notice the trace location values: 6091, 8127, 8128, and then 8191. These are key. 6091 is expected–that’s the next clock cycle. 8127 is unexpected. That means that we just spent 2036 cycles doing nothing.

Wait, I thought we set the CPU to count 2048 cycles before setting the trigger on our scope? How did the count end up at 2036?

It ended up at 2036 because we only counted ALU results, not results from the memory unit. Our last valid ALU result was on cycle 6078, we we’re still good here.

Or rather, we’re not good here: the CPU just stopped.

The good news is that we know what instruction we were executing. If you run objdump on our executable, you can see an assembly listing showing the instruction we just completed, and the context around it. To generate a listing, just type,

% zip-objdump -S -D simple_ping > simple_ping.txt

Indeed, I find listings of this type so valuable when debugging, that I have a special make target just to create listings. The target looks like:

%.txt : % $(OBJDUMP) -S -D $^ > $@

This will get turned into the command above any time I type

% make simple_ping.txt

Make is smart enough to recognize, when I type this, that if the source code for simple_ping has changed then it needs to be rebuilt before dumping the assembly to simple_ping.txt .

Sadly, I’ve recompiled this routine since capturing my trace. As a result, the assembly below that matches the loop in question doesn’t quite match the trace. Still, this was roughly the instruction we halted on.

7001808: 84 a8 97 88 LW (R5),R0 | MOV R1,R2

The critical piece of this disassembly is that we were trying to execute a two-instruction word, containing both a load word instruction ( LW ) followed by a register copy ( MOV ) instruction. (ZipCPU assembly reads left to right, hence the MOV instruction copies R1 into R2 .)

Fig. 10: ZipCPU ISA Quick Reference Sheet

When I introduced the ZipCPU ISA, I didn’t say much about compressed instructions. Now, however, you need to know a little bit about them to know what’s going on. Fig. 10 at the right shows my quick reference sheet of ZipCPU instructions. The first line shows the basic format of most instructions. The second line shows the format of a compressed instruction. It starts with a 1 in the MSB, followed by fifteen bits containing the first instruction, an unused bit, then fifteen bits containing the second instruction. The opcodes for these compressed instructions are shown in the top half of the third column: SUB , AND , ADD , CMP , LW , SW , LDI , and MOV .

ZipCPU compressed instructions are a little different from other architectures, because you cannot jump halfway into a word. All instruction words, whether compressed or not, are all word aligned. The assembler handles the compression, making certain for example that branch targets never get compressed into the middle of a compressed instruction word.

Further, the ZipCPU disassembler displays both instructions within a compressed instruction word on the same line, using a | to separate them–as shown above.

This will all come into play in a moment.

From the trace we know that we executed the LW instruction, but that the MOV instruction never made it into the execution unit(s).

Why not? What happened?

Moving from trace to bug

When I finally got to this point, I had no idea why the CPU might stall like that.

Yes, it is my CPU. Yes, I did write the code for it. Yes, the stall logic is all my own handiwork. Yes, I did just create this trace. However, when a design gets over 2500 lines, when you wrote it years ago, when it has multiple options it can be built with … it can become difficult to remember all the details. Worse, even if you do remember them, do you know why this happened?

So I turned to formal methods.

Instead of proving that the CPU “worked”, I wanted to prove that the CPU would never stall like this. Hence, I wanted to say that: if there’s an instruction in the read-operands stage, the CPU should always proceed to the execution stage.

So how might we write such an assertion?

Well, if there’s a valid instruction in the read-operands stage, then it should proceed to the execution stage, right?

always @ ( * ) if ( op_valid ) assert ( adf_ce_unconditional | mem_ce ) ;

Here adf_ce_unconditional says the instruction will move forward into the ALU , DIV ide or (not-yet-written) FPU (floating point unit), whereas mem_ce says the instruction will move into the memory unit.

Only, that’s not quite right. We don’t want to move forward if the debugging unit has requested that we halt. In that case, master_ce will be low. We also don’t want to move forward on a reset, or anytime the pipeline is cleared. Further, we don’t want to move to the execution stage if any of the execution units is busy and its result might affect one of our operands.

So, let’s adjust our assertion.

always @ ( * ) if (( op_valid ) && ( master_ce ) && ( ! clear_pipeline ) && ( ! i_reset ) && ( ! div_busy ) && ( ! div_valid ) && ( ! mem_busy ) && ( ! mem_valid ) && ( ! bus_err ) && ( ! alu_busy ) && ( ! alu_pc_valid ) && ( ! fpu_busy ) && ( ! fpu_valid ) && ( ! fpu_error )) assert ( adf_ce_unconditional | mem_ce ) ;

This assertion fails.

It has? Aha! Have I found my bug?

I pulled up the trace to look. This time, it’s a VCD file created by SymbiYosys. Not, this assertion hasn’t found the bug. Instead, all I found were examples of other reasons why the instruction shouldn’t move forward.

The execution units shouldn’t accept any more instructions if …

If this is a BREAK instruction, op_break , the pipeline needs to stop just before the instruction gets to the execution unit.

If the CPU has halted on an error, o_break will be true and we’ll need to make sure we are halted.

If we just retired an illegal instruction, we need to halt. Similarly, if the last memory instruction returned a bus error then we need to halt as well.

If there’s an interrupt pending or a trap was just executed, w_switch_to_interrupt will be true and we’ll need to switch to supervisor mode. In this case, the instruction in the read operands stage won’t move forward into any of the execution units either.

The list goes on. In the end, I was surprised at how long it was.

always @ ( * ) if (( op_valid ) && ( master_ce ) && ( ! clear_pipeline ) && ( ! i_reset ) && ( ! div_busy ) && ( ! div_valid ) && ( ! mem_busy ) && ( ! mem_valid ) && ( ! bus_err ) && ( ! alu_busy ) && ( ! alu_pc_valid ) && ( ! fpu_busy ) && ( ! fpu_valid ) && ( ! fpu_error ) && ( ! op_break ) && ( ! o_break ) && ( ! w_switch_to_interrupt ) && ( ! alu_illegal ) && ( ! ibus_err_flag ) && ( ! ill_err_i ) && ( ! idiv_err_flag )) assert ( adf_ce_unconditional | mem_ce ) ;

At this point, however, SymbiYosys had found my bug for me.

The problem was specific to the compressed instruction word. In the ZipCPU ISA, two instruction words may optionally be packed into one, and the ZipCPU has no ability to restart any instructions mid-word (currently). I may need to come back and “fix” this later, but the approach was chosen to simplify the instruction decoder: all instruction words start on word boundaries, some instruction words contain two instructions.

This was why the debugger failed: it had to wait for the second part of the compressed instruction word to complete before the CPU would halt, since there is (currently) no way to restart the CPU mid-instruction word.

This also means that I cannot jump into an interrupt mid-word. The internal flag, w_switch_to_interrupt properly captures this logic. The stall signal, however, did not. That meant that any time the interrupt signal went high in the middle of a two-instruction word, the interrupt logic would wait for the second instruction to complete while the stall logic would prevent the second instruction from starting.

The conflicting logic lead to the deadlock causing the CPU to hang.

The fix was simple: add a check to the stall logic so that we only stall if an interrupt is pending and we are on a single-instruction word or the second instruction of a double-instruction word.

Once fixed, my network ping program worked as intended again.

Wasn’t the CPU Formally Verified?

Some time ago, I wrote about all of the bugs I found when formally verifying this CPU. I even went to ORCONF 2018 to discuss all of the lessons I learned while going through this process.

Shouldn’t formal verification have caught this bug?

Yes, I suppose it should have.

Why didn’t it?

Because I never created a property asserting that the CPU would never stall indefinitely.

I’ve now gone back to the proof and added a similar property to ensure that the CPU will never stall in the decode stage either.

This wasn’t the first time I found a bug in my now formally verified CPU. The first time was when I tried setting the program counter from the debugging port. You may recall, I mentioned fixing this bug above. This oversight happened for the same reason: My initial set of formal properties was incomplete.

The CPU now has a set of debugging interface properties, “proving” that the debugger can set and adjust the CPU’s state.

What about simulation?

What about simulation? Could this bug have been caught during simulation? Why wasn’t it?

I suppose this bug could have been caught in simulation.

However, my current network packet simulator implements a loop-back mode only. This isn’t sufficient for testing the complex interaction of ARP requests, ARP replies, and ping requests and replies.

An online friend has recommended I try building a network simulator that integrates into the Linux Kernel, so that other computers can actually interact with my design while it is running in the simulator. This would be a truly awesome idea, and might well have caught this bug. I might just need to update my network simulation for this purpose.

Conclusions

So what are some of the lessons we can learn from this whole exercise?

First, your current bug may not be in the last place you looked. I had just worked on the flash controller. Mentally, my mind was in a bug hunting mode within the flash controller. Instead, the bug was somewhere else within my design–somewhere where I least expected it.

Second, the more things you change within a design, the more painful it is to find and fix the bugs within it. In many ways, I set myself up for problems with the shear number of things I had changed to get here. Understand, just because I presented one or two of these problems above, doesn’t mean that these were the only problems I had.

Third, did you notice how many tools I was using to debug the design? Besides the tools for debugging the CPU listed above, I was also using Wireshark for debugging the network as well.

Finally, debugging a CPU really crosses several disciplines. While this is a “hardware” task, look at what we needed to do in terms of support software along the way: We needed a debugging port to the CPU (S/W+H/W), a scope to gather a trace from within the CPU (H/W), software to interpret that trace (S/W), a disassembly of the failing program (compiler S/W), an understanding of the ZipCPU instruction set (assembly S/W), and finally formal methods to finally catch the bug in the end.

That’s quite the cross-disciplinary task!

This also means that, if you are interested in learning hardware design, there are several prerequisites before you dive into CPU design. The other way of looking at it is that CPU design is a wonderful field of study where you will be forced to learn all kinds of details of how CPUs work that you would never learn otherwise.