Safe VSP

I contributed this one-filer to the C64 Demo compo at Datastorm 2013. It ended up on 7th place, which I consider quite good for a technical proof of concept.

One of the tricks you can do on the C64 involves manipulating the video chip into reading the graphics data at an offset from where it's usually located. This allows you to scroll the display horizontally, and the trick is called VSP for Variable Screen Position. However, some machines crash when you attempt this, and the reason for that has always been a mystery. Not anymore.

Some say this forum thread reads like a thriller. Zer0-X managed to capture a VSP crash using a logic analyser and posted 15 MB of data. A year later I started looking into it and discovered the root cause. The proposed workaround is not very practical, but it supports my hypothesis, because people have tried it on their crash prone machines and so far it hasn't crashed.

A technical explanation appears in the demo as a 10-minute scroller, but the same text is provided below, for your convenience.

However, I found this an excellent opportunity to compose a 10-minute SID epic heavily inspired by Martin Galway's Parallax. In particular, I borrowed its musical structure that I affectionately think of as starter; main course; dessert — something a bit weird, followed by something substantial and straight-forward, followed by a sweet melodic part. The three courses are quite distinct, but complete each other. My tune is called Sideways in reference to both Parallax and the VSP trick.

Safe VSP has a csdb page and a pouët page, and was featured on Hacker News.

Technical lowdown: The dreaded VSP crash is caused by a metastability condition in the DRAM. Some have speculated that it has to do with refresh cycles, but hopefully the detailed explanation in this scroller will crush that myth once and for all. But first, this is what the machine behaves like from a programmer's point of view. Let us call memory locations ending in 7 or f fragile. Sometimes when VSP is performed, several fragile memory cells are randomly corrupted according to the following rule: Each bit in a fragile memory cell might be changed into the corresponding bit of another fragile cell within the same page. This specific behaviour can be exploited in several ways: One approach is to ensure that every fragile byte in a page is identical. If the page contains code, for instance, corruption is avoided if all the fragile bytes are $ea (nop). Similarly, in font definitions, the bottom line of each character could be blank. Another technique is to simply avoid all fragile memory locations. The undocumented opcode $80 (nop immediate) can be used to skip them. Data structures can be designed to have gaps in the critical places. This latter technique is used in this demo, including the music player of course. Data that cannot have gaps, i.e. graphics, is continuously restored from safe copies elsewhere in memory. You can use shift lock to disable this repair, and eventually you should see garbage accumulating on the screen. And yet the code will keep running. Thus, for the first time, the VSP crash has been tamed. Now for the explanation. The C64 accesses memory twice in every clock cycle. Each memory access begins with the LSB of the address (also known as the row address) being placed on an internal bus connected to the DRAM chips. As soon as the row address is stable, the row address strobe (RAS) signal is given. Each DRAM chip now latches the row address into a register, and this register controls a multiplexer which connects the selected memory row to a set of wires called sense lines. Each sense line connects to a single bit of memory. The sense lines have been precharged to a voltage in between logical zero and logical one. The charge stored in the memory cell affects the sense line towards a slightly lower or higher voltage depending on the bit value. A feedback amplifier senses the voltage difference and exaggerates it, so that the sense line reaches the proper voltage representing either zero or one. Because the memory cell is connected (through the multiplexer) to the sense line, the amplified charge will also flow back and refresh the memory cell. Hence, a memory row is refreshed whenever it is opened. VSP is achieved by triggering a badline condition during idle mode in the visible part of a rasterline. When this happens, the VIC chip gets confused about what memory address to access during the half-cycle following the write to $d011. It sets the internal bus lines to 11111111 in preparation for an idle fetch, but suddenly changes its mind and tries to read from an address with an LSB of 00000111. Now, since electrical lines can't change voltage instantaneously, there is a brief moment of time when each of the changing bits (bit 3 through 7) is neither a valid one nor a valid zero. But because the VIC chip changes the address at an abnormal time, there is now a risk that the RAS signal, which is generated independently by another part of the VIC chip, is sent while one or more bus lines is within the undefined voltage range. When an undefined voltage is latched into a register, the register enters a metastable state, which means that its output will flicker rapidly between zero and one several times before settling. This has catastrophic consequences for a DRAM: The row multiplexer will connect several different memory rows, one at a time, to the same sense lines. But as soon as some charge has moved from a memory cell to the sense line, the amplifier will pull it all the way to a one or a zero. If, at this point, another memory row is connected, then the charge will travel from the sense line into this other memory cell. In short, one memory cell gets refreshed with the bit value of a different memory cell. Note that because the bus lines change from $ff to $07, only memory rows with an address ending in three ones are at risk of being opened simultaneously. This explains why corruption can only occur in memory locations ending in 7 or f. Finally, this phenomenon hinges on the exact timing of the RAS signal at the nanosecond level, and on many machines the critical situation simply doesn't occur. The timing (and thus the probability of a crash) depends on factors such as temperature, VIC revision, parasitic capacitance and resistance of the traces on the motherboard, power supply ripple and interference with other parts of the machine such as the phase of the colour carrier with respect to the dotclock. The latter is assigned randomly at power-on, by the way, which could be the reason why a power-cycle sometimes helps. This is lft signing off.

Posted Wednesday 20-Mar-2013 23:23