Verilog 6502

Here, with full source code, is a cycle-accurate 6502 microprocessor core in Verilog HDL, which was automatically generated from a transistor-level netlist published by the Visual 6502 project. The 6502 has a two phase clock and this core is not only cycle but phase (half-cycle) accurate. The state of internal nodes is faithfully reproduced. All external address, data and control signals including RDY, SO, RES, IRQ, NMI, RW and SYNC are supported. The core runs 10 times faster than a real 6502 and occupies only 8% of the flops and 7% of the LUTs in the Xilinx xc3s500e FPGA on a Spartan 3E Starter Kit.

This project began in 1982 when my friend Alan Fothergill and I learnt the maths of colliding spheres at school. Alan wrote a Sinclair Spectrum pool game for Bug Byte Software Ltd. I tried to write pool for the Commodore PET, which lacked high-resolution graphics, so I made a vector display on my oscilloscope using digital-to-analog converters. I never got vector pool working properly at the time, but still have my original 5¼" floppy disks and recently recovered the data off them using Kryoflux. I thought it would be fun to get my pool code running again on a 6502 core in FPGA.

The 6502 has been studied and reverse-engineered more than any other microprocessor: in 1982, Donald Hanson created a detailed block diagram from original blueprints; more recently, Balázs Beregnyei drew transistor-level schematics by hand from his own die photographs; and the Visual 6502 project extracted vector polygon models of metal and silicon layers from die photographs, and published a transistor-level netlist. Very recently, Eric Schlaepfer designed the MOnSter 6502, a transistor-scale replica, using discrete MOSFETs. This web page is about how I converted the Visual6502 netlist into Verilog.

Verilog generation was a semi-automatic, guided process of sub-circuit recognition and extraction. The definition of sub-circuits and extraction order was manual; however, that done, the process ran without intervention. It was like an audit of the chip, in which every last transistor was accounted for. Repeated sub-circuits only needed defining once and all instances were extracted. Verilog fragments were hand-crafted to implement each class of sub-circuit. Mostly, these are simple assign statements, one line per logic gate. By far the most complex sub-circuit was a bitslice of the internal data paths, of which there are eight instances in the chip.

NMOS to Verilog





assign o[2] = ~i[1]; assign o[5] = ~(i[3] | i[4]); assign o[8] = ~(i[6] & i[7]);



Xilinx Spartan-3 FPGA storage registers can emulate either D-type flip-flops or transparent latches; however, toolchain support for latches is limited. Data transitions can pass through more than one cascaded latch at a time, making timing closure tricky. Zero-page operands pass along the internal data bus (IDB) and internal address bus (ADL) to the address bus latch (ABL) in half a cycle; however, consecutive latches usually open on opposite clock phases. Whilst edge-triggered flops are synchronous elements; latches are asynchronous, best emulated in FPGA using a simple 2-input MUX. Registers are inserted to break combinatorial loops:







assign o[789] = i[456] ? i[123] : i[789];



The auto-generated Verilog is a single `include file containing combinatorial assign statements for logic gates; plus a few instantiations of a combinatorial helper module to cope with data paths. The helper is a custom multiplexer with parameterised input width. The select inputs are "one-or-more-hot" encoded. If there is contention when multiple data inputs are enabled, low inputs win. If no inputs are enabled, the output becomes a capacitive storage node, like the transparent latches. This emulates an NMOS structure comprising uni-directional pass transistors with a common output node:

module DATAPATH_MUX #( parameter N=4 ) ( output wire out, input wire in, input wire [N-1:0] s, input wire [N-1:0] d); assign out = (|s) ? &(d|(~s)) : in; endmodule

The most complex sub-circuit is a bitslice of the datapath. This includes precharge transistors, bus bars, multiplexors and two bi-directional pass transistors. There are 8 instances of this sub-circuit in the chip. To the right below is it encoded as a netlist. Arbitrary numeric identifiers are assigned to each transistor and node. 100-101 are outputs to the PCL and PCH latches. 200-207 are inputs from IDL, X, Y, A, S, PCL, PCH and the ALU. 300-343 are data path control (dpc) lines. 400-443, 700 and 800-803 are transistors. 500-503 are the special bus (SB), internal data bus (IDB), address low (ADL) bus and address high (ADH) bus respectively:

AddTran(400, 300, 202, 500); // dpc0_YSB AddTran(402, 302, 201, 500); // dpc2_XSB AddTran(404, 304, 204, 500); // dpc4_SSB AddTran(405, 305, 204, 502); // dpc5_SADL AddTran(419, 319, 207, 500); // dpc19_ADDSB7, dpc20_ADDSB06 AddTran(421, 321, 207, 502); // dpc21_ADDADL AddTran(424, 324, 203, 500); // dpc24_ACSB AddTran(425, 325, 500, 501); // dpc25_SBDB AddTran(426, 326, 203, 501); // dpc26_ACDB AddTran(427, 327, 500, 503); // dpc27_SBADH AddTran(428, 328, 503, NODE_vss); // dpc28_0ADH0, dpc29_0ADH17 AddTran(430, 330, 503, 101); // dpc30_ADHPCH AddTran(431, 331, 206, 101); // dpc31_PCHPCH AddTran(432, 332, 206, 503); // dpc32_PCHADH AddTran(433, 333, 206, 501); // dpc33_PCHDB AddTran(437, 337, 205, 501); // dpc37_PCLDB AddTran(438, 338, 205, 502); // dpc38_PCLADL AddTran(439, 339, 205, 100); // dpc39_PCLPCL AddTran(440, 340, 502, 100); // dpc40_ADLPCL AddTran(441, 341, 600, 502); // dpc41_DL/ADL AddTran(442, 342, 600, 503); // dpc42_DL/ADH AddTran(443, 343, 600, 501); // dpc43_DL/DB AddTran(700, NODE_cp1, 200, 600); AddTran(800, NODE_cp2, 500, NODE_vcc); AddTran(801, NODE_cp2, 501, NODE_vcc); AddTran(802, NODE_cp2, 502, NODE_vcc); AddTran(803, NODE_cp2, 503, NODE_vcc);

wire sb_local_500 = i[300]|i[302]|i[304]|i[319]|i[324]; wire idb_local_501 = i[326]|i[333]|i[337]|i[cp1]&i[343]|i[?]; wire adh_local_503 = i[328]|i[332]|i[cp1]&i[342]; wire db2sb_500 = idb_local_501 & i[325]; wire sb2db_501 = sb_local_500 & i[325]; wire adh2sb_500 = adh_local_503 & i[327]; wire sb2adh_503 = sb_local_500 & i[327]; DATAPATH_MUX #(8) mux_sb_500 (.o(o[500]), .i(i[500]), .s({i[cp2],i[300],i[302],i[304],i[319],i[324],db2sb_500,adh2sb_500}), .d({i[vcc],i[202],i[201],i[204],i[207],i[203],i[501],i[503]})); DATAPATH_MUX #(7) mux_idb_501 (.o(o[501]), .i(i[501]), .s({i[cp2],i[326],i[333],i[337],i[cp1]&i[343],sb2db_501,i[?]}), .d({i[vcc],i[203],i[206],i[205],i[200],i[500],i[?]})); DATAPATH_MUX #(6) mux_adl_502 (.o(o[502]), .i(i[502]), .s({i[cp2],i[305],i[321],i[338],i[cp1]&i[341],i[?]}), .d({i[vcc],i[204],i[207],i[205],i[200],i[vss]})); DATAPATH_MUX #(5) mux_adh_503 (.o(o[503]), .i(i[503]), .s({i[cp2],i[328],i[332],i[cp1]&i[342], sb2adh_503}), .d({i[vcc],i[vss],i[206],i[200],i[500]})); DATAPATH_MUX #(2) mux_pcl_100 (.o(o[100]), .i(i[100]), .s({i[339],i[340]}), .d({i[205],i[502]})); DATAPATH_MUX #(2) mux_pch_101 (.o(o[101]), .i(i[101]), .s({i[330],i[331]}), .d({i[503],i[206]}));

The helper MUX retains its previous state when no inputs are selected. This emulates capacitive storage, which the 6502 designers exploit to decrement registers. Internal busses are precharged to $FF during Φ2 and then connected to one of the ALU input latches in the immediately following Φ1, during which the nodes are floating. This forces $FF onto the ALU input by swamping it with charge from physically larger nodes. A register (X, Y, SP or PCH) is routed to the other ALU input and decremented by adding $FF (-1).

SubGemini

The problem of finding sub-graphs in larger graphs is called Subgraph Isomorphism and is a difficult one to solve. All so-called invariant properties must be exploited: devices have a type e.g. enhancement or depletion mode MOSFETs in NMOS, P and N-channel in CMOS; nets have an order, which is the number of ports they connect to; and edges have port type e.g. input, output e.t.c. That's all there is to go on. Here is a sub-circuit schematic and its representation within SubGemini, labelled with all the distinguishing features of its edges and vertices:

Notice the net types: internal, external and special. Special nets are used for power supplies and global clocks. External nets are those at the boundaries of a sub-circuit. There is only one internal net (of order 3) in this example. Any matching nets in the main circuit will also be of order 3 and will be connected, by their drains, to two enhancement mode MOSFETs, and by its source to a third. We could easily make a shortlist of all such candidate vertices in the main circuit. It would then be fairly trivial in this simple case to check beyond the vertices immediately neighbouring the candidates to verify matches of the entire sub-circuit.

The process described in the previous paragraph is close to how SubGemini actually works, in cases as trivial as the above example. The starting point, that order 3 internal net in the sub-circuit, is called the key vertex; and the shortlist is called the candidate vector. Key vertices can be devices or internal nets. Clearly, one could choose any key vertex at random, make a candidate vector shortlist, and use a brute-force recursive walk algorithm to verify sub-circuit matches; however, this would be inefficient.

The recursive walk is a depth-first approach. It explores many wrong paths leading to dead-ends, sometimes descending way down the call stack before back-tracking. SubGemini uses a far more efficient breadth-first technique that effectively searches outwards from every vertex through its surrounding neighbours, one edge at a time in all directions! Initially, all vertices are labelled using their invariant properties of type and order. There follows an iterative process of re-labelling, combining previous labels with connecting edge types and labels of immediate neighbours. All labels, types and orders are simply 32-bit integers.

The remarkable result of this re-labelling process is that vertices rapidly acquire large and relatively unique labels, which are like a hash of the surrounding structure. Re-labelling occurs in parallel in both main and sub-circuits; and matching vertices have equal labels. Of course false duplicates can also arise in non-matching vertices; however, this is relatively uncommon. SubGemini uses a two phase process to first build a candidate vector and then validate the candidates. Both phases involve iterative re-labelling. In SubGemini terminology, grouping vertices with matching labels is called partitioning.

Sub-circuits are bounded with external nets. But matching instances in the main circuit extend seemlessly beyond. At first, labels in the sub-circuit match those in the main circuit; however, after some iterations, labels within matching instances will be affected by vertices outside. Iteration stops just before the innermost labels are corrupted. Each label has a flag to indicate if it is still trustworthy. Devices and internal nets are initially valid; external nets are not. Invalidity propagates along edges with each iteration. Special nets have globally unique labels and are never re-labelled.

Hopefully, the above will provide some insight into how SubGemini works. Of course, there are additional complexities which have not been mentioned. Highly symmetrical circuits lead to ambiguities, which can only be resolved by back-tracking in the last resort. In my application I also cheat by giving node matching hints in a few special cases. Readers interested in learning more can refer to the original paper [1] or my source code, linked at the bottom of this page.

Performance

The core passes Klaus Dormann's test suite and AllSuiteA. Since it emulates the 6502 at transistor level, I would expect it to handle most of the so-called undocumented opcodes correctly. I am aware of the work by Peter Monta in which he used a 6-bit code to model voltages and currents at certain nodes. My core uses a 1-bit (Boolean) state for all nodes and resolves contention with a rule that pull downs to logic '0' (Vss) always win. It is possible that my core may not correctly emulate some internal conflicts that might arise whilst executing undefined opcodes.

Clock ratio

The above rollover waves are from a simulation of the synthesizable core. The effect of registering the auto-generated logic output is evident in the staggering of edges as they propagate. Registers breaking combinatorial feedback paths are required for synthesis, but not for simulation. Much nicer waves with zero propagation delays were obtained from a (100% combinatorial) behavioural model. This is a very useful tool for studying 6502 internals. Buffers and inverters are not optimised away, so more nodes can be probed. See source code links below.

Earlier attempt

Pool

The 19" rack behind my head in this 1980 photo is a microprocessor-controlled Home Telephone Exchange.

Source code

Links

Very useful Wiki at the Visual 6502 project

Eric Schlaepfer's MOnSter 6502

Frank Kingswood's AS65 assembler

References