The first version of my open-source OpenCV–compatible FPGA Stereo Correspondence Core is now available!

(have a look at my previous FPGA Stereo Vision Project post for some more context)

It’s written purely in synthesizable Verilog, and uses device-agnostic inference for all FPGA primitives (though the current implementation is more optimized for Xilinx devices). I’m releasing it under a standard 3-clause BSD license.

The design is heavily pipelined. Under realistic conditions (in a highly-utilized, slowest-speed-grade part without any floor-planning), it can run around 150 MHz in Spartan-3E and Spartan-6 parts, and around 300 MHz in Virtex-6 parts. Much higher speeds (50+%) are possible under unrealistic (ideal) conditions.

The design is fully parameterized and highly scalable; some example implementations include:

The core has been verified in simulation using Verilator with SystemC testbenches. Post-synthesis results (from Xilinx’s XST tool) have been verified using a simplified Verilog testbench and Xilinx’s own ISim simulator.

Scope

Now, before everyone runs off and tries to build their own open-source Kinect, I must stress that this isn’t a complete solution just yet; here’s a block diagram of what I have implemented:

If we now refer back to the high-level block diagram that I presented before:

..we can see that this core implements all of the “Stereo Correspondence” block, some/all of the “Post-Processing” block and (had I actually included it on the original diagram) the “Pre-Filtering” block. While “Image Rectification” is the only significant missing image pipeline component, there’s still a lot of other system level infrastructure to develop (external interfaces, buses, etc.) before I can call the project “complete.”

That being said, the correspondence core easily represents the most critical, most resource-intensive and highest-performance component of the entire system. Completing it is a major milestone in the project.

Index

OpenCV compatibility

The correspondence core implements a significant subset of the functionality provided by OpenCV’s block-matching stereo correspondence algorithm ( findStereoCorrespondenceBM ). When using the supported subset, the core produces 100% identical results to that of the OpenCV algorithm (specifically: the non-SSE implementation present in OpenCV 2.2.0; there are actually subtle differences between the unoptimized vs. SSE versions).

Looking at the parameter/state structure for OpenCV’s algorithm ( CvStereoBMState ), we can enumerate how my correspondence core compares:

Pre-filtering: preFilterType : CV_STEREO_BM_XSOBEL is fully supported. CV_STEREO_BM_NORMALIZED_RESPONSE is not supported. preFilterSize : not applicable for CV_STEREO_BM_XSOBEL . preFilterCap : fully supported.

Stereo Correspondence: SADWindowSize : fully supported. minDisparity : not currently supported (fixed at 0). numberOfDisparities : fully supported.

Post-processing: textureThreshold : fully supported. uniquenessRatio : fully supported. speckleRange : not currently supported. speckleWindowSize : “”



Due to performance related implementation details, some combinations of parameters (which would otherwise be valid) are prohibited. The usage section describes the core’s parameters, and all applicable restrictions on them.

Likewise, the core supports additional tuning parameters that allow for creating non-OpenCV-compatible implementations with a smaller logic footprint.

Compatibility example

Here are two images in the ‘Cones’ data set from the Middlebury Stereo Vision Page:

And this is the ground-truth disparity image:

We convert the two inputs images to gray-scale, and then run them through OpenCV’s findStereoCorrespondenceBM function with state parameters of:

state->preFilterType = CV_STEREO_BM_XSOBEL; state->preFilterCap = 7; state->SADWindowSize = 17; state->minDisparity = 0; // not supported by dlsc_stereobm state->numberOfDisparities = 96; state->textureThreshold = 1300; state->uniquenessRatio = 25; state->speckleRange = 0; // not supported by dlsc_stereobm state->speckleWindowSize = 0; // ""

This filters the input images to something like this:

(just showing the left; the right is similar)

Subsequently, it creates this unfiltered disparity map:

And finally yields the following filtered disparity map:

(normally you would only care about the final result; I’ve included the intermediate steps so you can see what the image looks like at various stages in the pipeline)

If we also run the same input images through my correspondence core ( dlsc_stereobm_prefiltered , with equivalent configuration parameters), we get:

No need to break out a diff program: they’re completely identical.

(in fact, all of the intermediate steps presented above were actually produced by my reference model, while the OpenCV result was truly from executing the code present in OpenCV 2.2.0 (with SSE disabled))

Using the Stereo Correspondence Core

I’m initially releasing three versions of my correspondence core:

dlsc_stereobm_core – the raw stereo correspondence core; doesn’t include any additional buffering, nor does it include pre-filtering.

– the raw stereo correspondence core; doesn’t include any additional buffering, nor does it include pre-filtering. dlsc_stereobm_buffered – a wrapper around the core which includes extra input and output buffering (which also has the benefit of providing an asynchronous clock domain crossing, so the rest of the system can run at a slower clock).

– a wrapper around the core which includes extra input and output buffering (which also has the benefit of providing an asynchronous clock domain crossing, so the rest of the system can run at a slower clock). dlsc_stereobm_prefiltered – a thin wrapper that includes the buffered wrapper and a pre-filtering core ( dlsc_xsobel_core ).

I’ll primarily be talking about usage of dlsc_stereobm_prefiltered , since it’s the easiest to integrate and has the greatest support for OpenCV features.

Each wrapper levels adds additional logic and non-trivial amounts of RAM, so it’s still possible that you may want to use a lower-level core in order to save on these resources. Some of the usage differences between the various versions are discussed later.

Here’s an example instantiation of the pre-filtered wrapper (this exactly matches the example OpenCV configuration seen above):

dlsc_stereobm_prefiltered #( .DATA ( 8 ), // input image bits-per-pixel .DATAF ( 4 ), // filtered image bits-per-pixel .DATAF_MAX ( 14 ), // 2*state->preFilterCap .IMG_WIDTH ( 640 ), // image width .IMG_HEIGHT ( 534 ), // image height .DISP_BITS ( 7 ), // output disparity image is // (DISP_BITS+SUB_BITS) bits-per-pixel .DISPARITIES ( 96 ), // state->numberOfDisparities .SAD_WINDOW ( 17 ), // state->SADWindowSize .TEXTURE ( 1300 ), // state->textureThreshold .SUB_BITS ( 4 ), // must be 4 for OpenCV compatibility .SUB_BITS_EXTRA ( 4 ), // must be 4 for OpenCV compatibility .UNIQUE_MUL ( 1 ), // state->uniquenessRatio == .UNIQUE_DIV ( 4 ), // (UNIQUE_MUL*100)/UNIQUE_DIV .OUT_LEFT ( 1 ), // enable out_left (filtered image output) .OUT_RIGHT ( 1 ), // enable out_right // remaining parameters are only throughput/timing related; not functional .MULT_D ( 8 ), // process MULT_D disparities per pass .MULT_R ( 2 ), // process MULT_R rows at a time .PIPELINE_BRAM_RD ( 1 ), // synthesis performance tuning .PIPELINE_BRAM_WR ( 0 ), // "" .PIPELINE_FANOUT ( 0 ), // "" .PIPELINE_LUT4 ( 0 ) // "" ) dlsc_stereobm_prefiltered_inst ( .core_clk ( core_clk ), // high-speed async clock for core .clk ( clk ), // clock for interfaces .rst ( rst ), // synchronous reset .in_ready ( in_ready ), // ready/valid handshake for input .in_valid ( in_valid ), // "" .in_left ( in_left ), // left image input .in_right ( in_right ), // right image input .out_ready ( out_ready ), // ready/valid handshake for output .out_valid ( out_valid ), // "" .out_disp ( out_disp ), // disparity image output .out_masked ( out_masked ), // disparity outside of valid area .out_filtered ( out_filtered ), // disparity filtered by post-proc .out_left ( out_left ), // filtered left image output .out_right ( out_right ) // filtered right image output );

Parameters

All of the core’s configuration parameters are set when the core is instantiated via standard Verilog parameters (not a single `define required!). No provision is provided for run-time adjustment of configuration (this core is designed for reconfigurable FPGAs, not general-purpose ASICs).

There are two major sets of configuration parameters. The first set deals with functional details (which can generally be mapped to OpenCV terms), while the second set deals more with performance/throughput-related implementation details (which don’t have an OpenCV equivalent, but can impose additional restrictions on the functional parameters):

Functional parameters Pixel size and pre-filtering DATA – width, in bits, of each input pixel. DATAF – width, in bits, of each filtered pixel (must be enough to represent DATAF_MAX ). DATAF_MAX – maximum value for filtered pixels (set to twice OpenCV’s preFilterCap ; will default to (2**DATAF)-1 if not explicitly set). Image size IMG_WIDTH – width, in pixels, of a whole frame. IMG_HEIGHT – height, in pixels, of a whole frame. Disparity search space DISP_BITS – bits required to represent DISPARITIES-1 . Output disparity values are DISP_BITS+SUB_BITS bits wide. DISPARITIES – number of disparity levels to search (equivalent to OpenCV’s numberOfDisparities ; will default to 2**DISP_BITS if not explicitly set). SAD_WINDOW – size of the sum-of-absolute-differences window (must be odd; equivalent to OpenCV’s SADWindowSize ). Post-processing TEXTURE – texture filtering (equivalent to OpenCV’s textureThreshold ). SUB_BITS – number of bits used for sub-pixel interpolation results (0 to disable; must be 4 for OpenCV compatibility) SUB_BITS_EXTRA – number of extra internal bits to compute for potentially increased precision when rounding sub-pixel interpolation results (0 is recommended to save a bit of logic, but must be 4 for strict OpenCV compatibility). UNIQUE_MUL and UNIQUE_DIV – these control uniqueness filtering, and are approximately equivalent to OpenCV’s uniquenessRatio . The conversion between the two is: ((UNIQUE_MUL*100)/UNIQUE_DIV) == (uniquenessRatio) . UNIQUE_DIV must be a power-of-2.

Implementation parameters Outputs OUT_LEFT – enable the out_left output OUT_RIGHT – enable the out_right output Parallelization MULT_D – the number of disparity levels to compute in parallel. DISPARITIES must be an integer multiple of this. MULT_R – the number of image rows to process in parallel. IMG_HEIGHT must be an integer multiple of this. SAD_WINDOW/2 must also be an integer multiple of this. Synthesis performance tuning PIPELINE_BRAM_RD – adds extra pipelining on block RAM read paths; this is recommended for most FPGA architectures. PIPELINE_BRAM_WR – adds extra pipelining on block RAM write paths; this is recommended for high-frequency targets (e.g. Virtex-6, and sometimes Spartan-6). PIPELINE_FANOUT – adds extra pipelining on some high-fanout paths; this is recommended for high-frequency targets (e.g. Virtex-6). PIPELINE_LUT4 – this optimizes the design for FPGA architectures with 4-input LUTs (e.g. Spartan-3). You shouldn’t typically need it on newer devices with 6-input LUTs.



All of the parameters must be integers. Boolean values should use 0 for ‘false’ and 1 for ‘true’.

Some parameters which are more ammenable to run-time adjustment (e.g. TEXTURE and UNIQUE_MUL ) may, in the future, be converted to input ports on the core.

Parameters – determining throughput

To set some of the core’s performance related parameters, you need to know your required throughput.

The required throughput is based (approximately) on two things: your effective pixel clock, and the number of DISPARITIES you want to search. The core must, in general, effectively compute all DISPARITIES in a single pixel clock. To do this, it needs a combination of clock-frequency advantage and parallelization. That is:

MULT_D * MULT_R * (core_clk / pixel_clk) >= DISPARITIES

Note that if you’ve enabled TEXTURE filtering, the core requires an additional processing pass, and the above equation should be ammended to:

MULT_D * (MULT_R * (core_clk / pixel_clk) - 1) >= DISPARITIES

A reasonable approximation for the effective pixel clock is to simply multiply the image size by the frame-rate; for example, for the MT9V032 image sensor: 752×480 @ 60 FPS = 21.7 MHz.

In actuality, the core only needs to process approximately (IMG_WIDTH - DISPARITIES + SAD_WINDOW) pixels per row (rather than the full IMG_WIDTH ). Thus, in a well-buffered system (one that can hide the dead-time resulting from the sensor’s horizontal and vertical blanking), this approximation should yield a reasonably conservative design.

For example, with DISPARITIES = 120 and SAD_WINDOW = 17 , the effective pixel clock is actually much closer to (752-120+17)*480*60 = 18.7 MHz (over a 10% margin).

Continuing with the MT9V032 example: suppose we’re targetting a low-cost Spartan-class FPGA, and expect to run at upwards of 150 MHz. We’ll call it 132 MHz (~21.7 MHz * 6), to keep the math clean. With a 6x frequency advantage, that leaves us with a deficit of 120 / 6 = 20 to make up for with parallelization. Setting MULT_D = 10 and MULT_R = 2 would satisfy that ( 10 * 2 = 20 ).

Working in reverse, we can estimate how many cycles the core will take to process an image:

The core requires (DISPARITIES/MULT_D) passes to process MULT_R rows. If TEXTURE is enabled, one additional pass is required.

Each pass takes approximately (IMG_WIDTH-DISPARITIES+SAD_WINDOW) cycles.

Thus, processing one whole frame requires approximately this many cycles:

(IMG_HEIGHT/MULT_R)*((DISPARITIES/MULT_D)+(TEXTURE?1:0))*(IMG_WIDTH-DISPARITIES+SAD_WINDOW)

For the above MT9V032 example, this works out to:

(480/2)*(120/10)*(752-120+17) = 1869120 cycles * 132 MHz = 14.2 ms

Which is just a bit shy of the MT9V032’s default frame-valid-time of 15.23 ms (indicating that we may actually be okay without mitigating vertical blanking through buffering), and a healthy margin short of the 16.7 ms implied by 60 FPS operation.

Parameters – resource impact

All of these parameters have some impact on FPGA logic usage, block RAM usage, and (to a much lesser extent) operational frequency. Performing a quick trial synthesis of the core with your desired parameters is really the best way to accurately gauge resource usage. The synthesis section includes a bunch of specific examples.

That being said, here are some vague generalities to help in making trade-offs regarding parameters:

DATA has a small impact on logic and RAM usage, since its effects are confined to the small pre-filtering block.

DATAF has a huge impact on logic and RAM usage, since it affects large swaths of the correspondence pipeline. You should use as small of a value here as you can get away with (this will be driven by the DATAF_MAX parameter, which, in OpenCV terms, is 2*preFilterCap ). With a prefilterCap of 7 (output range of [0,14]), as used in the above example, only 4 bits is required for DATAF .

IMG_WIDTH and IMG_HEIGHT have essentially zero direct impact on logic usage (but significant throughput implications). IMG_WIDTH (but not IMG_HEIGHT ) has a large impact on RAM usage (all of the core’s internal buffers scale with IMG_WIDTH ).

Power-of-2 values for IMG_WIDTH will, in general, most efficiently use RAM. Unfortunately, many image sensors are just a bit above a power-of-2. While all of the front-end’s buffers are exactly IMG_WIDTH deep, the disparity block has a large set of buffers as well, which are exactly (IMG_WIDTH - (DISPARITIES-1) - (SAD_WINDOW-1)) deep. So, even if you can’t have a power-of-2 IMG_WIDTH , you may be able to find a combination of IMG_WIDTH , DISPARITIES , and SAD_WINDOW that yields an efficiently-sized disparity buffer.

DISP_BITS (and by extension, DISPARITIES ) have only a small impact on logic and RAM usage (but significant throughput implications).

SAD_WINDOW has a big impact on logic usage, and a bigger impact on RAM usage. It controls the size of the SAD adder trees, and the number of image rows that must be internally buffered.

TEXTURE has essentially zero impact on logic and RAM usage, since it re-uses the SAD pipeline to do its work (this, of course, has throughput implications). If you’re already using UNIQUE_MUL , you may find that TEXTURE is redundant (in general, untextured regions should fail the uniqueness check as well).

The other post-processing options can be surprisingly costly in terms of logic and (especially) RAM usage. When either SUB_BITS or UNIQUE_MUL is selected, the core requires extra RAM to track additional SAD values. If you enable UNIQUE_MUL , you may as well enable SUB_BITS as well, since it is only a small incremental cost (the opposite case is somewhat less apparent).

The exact values of UNIQUE_MUL and UNIQUE_DIV , once enabled, don’t play a big role in logic usage.

SUB_BITS_EXTRA is really only required if you want 100% OpenCV compatibility. If left at 0 (but with SUB_BITS still at 4), you can achieve results that are within +-1 LSbit (+-1/16th of a disparity) of the OpenCV result. You can save a couple percent on FPGA logic by leaving SUB_BITS_EXTRA at 0.

MULT_D has a large impact on logic usage (since it results in duplication of large parts of the stereo correspondence pipeline), and no impact on RAM usage. High MULT_D values (>>10) can lead to fanout issues on the row buffer outputs. PIPELINE_FANOUT can help here.

MULT_R is, in theory, a relatively cheap way of gaining additional throughput. If the post-processing options are disabled, this is generally true. MULT_R is best used when the SAD pipeline is large (i.e. large DATAF and/or large SAD_WINDOW ), and when post-processing is not needed.

With MULT_R , a small amount of logic is needed to extend the SAD tree to an extra row, and some extra RAM is needed to buffer that row. If post-processing is enabled, however, use of MULT_R can lead to significant increases in RAM and logic usage. A MULT_R of 1 or 2 is typically recommended.

(if you’re using the unbuffered dlsc_stereobm_core module, MULT_R also has significant system-level implications, since it requires supplying/consuming multiple rows to/from the core in parallel)

OUT_LEFT and OUT_RIGHT have little impact on logic, but enabling them requires some additional RAM for buffering. If you don’t have a need for the pre-filtered image data after it goes through the core, then you should set these to 0.

The four pipelining options have a small impact on logic usage (mostly additional registers). There is no harm in enabling all of them (and this should yield the highest frequency solution, assuming a device that isn’t highly utilized), but you may be able to save some resources by only using the ones that are required to meet timing (use my recommendations as a starting point, and then run your own synthesis tests).

Ports

The actual port-level interface to the buffered/pre-filtered wrapper is quite simple:

System core_clk – high-speed clock for the internal stereo correspondence core. This clock can be totally asynchronous to the rest of the system. clk – the interface/pixel clock. All inputs and outputs from the buffered wrapper are synchronous to this clock. rst – synchronous reset input. This synchronously resets all of the interface and core logic. A single-cycle reset pulse is sufficient.

Input in_ready – handshake output indicating that the core can accept a new pair of input pixels. in_valid – handshake input indicating that a valid pair of pixels is currently being supplied to the core. in_left – left pixel input (width: DATA ) in_right – right pixel input (width: DATA )

Output out_ready – handshake input indicating that the core can supply a new set of output data. out_valid – handshake output indicating that the core is currently supplying a valid set of output data. out_disp – computed disparity value (width: DISP_BITS + SUB_BITS ) out_masked – pixel was outside usable region (disparity is invalid, and out_disp has been zeroed). out_filtered – disparity value failed the uniqueness ratio or texture threshold check (but it’s still presented, in case you want it). out_left – left pixel output (width: DATAF ); will be 0 if OUT_LEFT isn’t set. out_right – right pixel output (width: DATAF ); will be 0 if OUT_RIGHT isn’t set.



It’s designed to be as user-friendly as possible. All of your interface logic can run at a relatively slow speed (e.g. at the pixel clock) for ease of timing closure, while the actual stereo correspondence core runs as fast as you need it to (via the separate core_clk input). All those tricky asynchronous boundary crossings are handled within the wrapper.

Ready/Valid handshaking

A brief elaboration on ready/valid handshaking, for the uninitiated: ready is driven by the consumer/sink/slave block. valid is driven by the producer/source/master (along with all qualified data). When ready and valid are both asserted on a clock edge, data is transferred (this is the only time a transfer can happen).

The consumer may deassert ready to prevent data transfer (e.g. if its internal buffers are full).

After data is transferred, the producer may supply another piece of data and leave valid asserted, or it can deassert valid to indicate that no more data is immediately available. Once data becomes available again, the producer can supply the new data and assert valid again.

ARM’s AMBA-AXI bus is a good example of a high-performance interface relying on a ready/valid handshake.

The unbuffered core

You’ve already seen one of the wrapper modules ( dlsc_stereobm_prefiltered ). There also exists another nearly-identical wrapper which omits any pre-filtering functionality: dlsc_stereobm_buffered . If you don’t want to use xsobel pre-filtering, but would still like to benefit from buffering, you should use this module. Its parameters and port-list are nearly identical, so I won’t go into detail.

The 3rd module isn’t a wrapper at all – it’s the actual stereo correspondence core: dlsc_stereobm_core . Its interface is a little bit different (especially on the output side); here’s an instantiation:

dlsc_stereobm_core #( .DATA ( 4 ), // input/output image bits-per-pixel .DATA_MAX ( 14 ), // 2*state->preFilterCap .IMG_WIDTH ( 640 ), // image width .IMG_HEIGHT ( 534 ), // image height .DISP_BITS ( 7 ), // output disparity image is // (DISP_BITS+SUB_BITS) bits-per-pixel .DISPARITIES ( 96 ), // state->numberOfDisparities .SAD_WINDOW ( 17 ), // state->SADWindowSize .TEXTURE ( 1300 ), // state->textureThreshold .SUB_BITS ( 4 ), // must be 4 for OpenCV compatibility .SUB_BITS_EXTRA ( 4 ), // must be 4 for OpenCV compatibility .UNIQUE_MUL ( 1 ), // state->uniquenessRatio == .UNIQUE_DIV ( 4 ), // (UNIQUE_MUL*100)/UNIQUE_DIV // remaining parameters are only throughput/timing related; not functional .MULT_D ( 8 ), // process MULT_D disparities per pass .MULT_R ( 2 ), // process MULT_R rows at a time .PIPELINE_BRAM_RD ( 1 ), // synthesis performance tuning .PIPELINE_BRAM_WR ( 0 ), // "" .PIPELINE_FANOUT ( 0 ), // "" .PIPELINE_LUT4 ( 0 ) // "" ) dlsc_stereobm_core_inst ( .clk ( clk ), // clock for interfaces and core .rst ( rst ), // synchronous reset .in_ready ( in_ready ), // ready/valid handshake for input .in_valid ( in_valid ), // "" .in_left ( in_left ), // left image input .in_right ( in_right ), // right image input .out_busy ( out_disp_busy ), // feedback to core requesting it temporarily halt output // (50~100 more values will be sent *after* busy is asserted) .out_disp_valid ( out_disp_valid ), // qualifier for out_disp_ signals .out_disp_data ( out_disp_data ), // disparity image output .out_disp_masked ( out_disp_masked ), // disparity outside of valid area .out_disp_filtered ( out_disp_filtered ), // disparity filtered by post-proc .out_img_valid ( out_img_valid ), // qualifier for out_img_ signals .out_img_left ( out_img_left ), // left image output .out_img_right ( out_img_right ) // right image output );

There are a few major differences between the core and its various buffered wrappers:

There’s only one clock input now. clk clocks both the core and all of the interfaces signals.

The outputs no longer use a two-way ready/valid handshake (the input still does, though). Instead, the output is split into two groups of signals: the disparity output (prefixed with out_disp_ ) and the image output (prefixed with out_img_ ). Each group is qualified with a unidirectional _valid signal.

In operation, you’ll typically see the image output lead the disparity output by 50~100 pixels (the time required for those pixels to traverse the stereo correspondence pipeline). When the disparity output is masked, this delay will be significantly less (as masked pixels don’t traverse the pipeline).

Lacking a ready/valid handshake, the only way to throttle the core’s output is via the out_busy feedback input. When this input is asserted, the frontend in the core will stop sending pixels down the pipeline. Pixels already in the pipeline, however, will continue to be output. Due to the pipeline’s depth, a significant number (50~100) of pixels will exit the disparity output even after out_busy is asserted; downstream logic must be able to tolerate this.

If MULT_R > 1 , then all of the core’s ports (excluding clocks, resets and handshake/qualifier signals) all become wider to accommodate transfer of multiple rows in parallel.

Without buffering, the core’s inputs and outputs are very “bursty” (brief periods of significant activity; idle the rest of the time). The core takes DISPARITIES / MULT_D passes to process each set of MULT_R rows. It only transfers data on the last pass. Any throttling of the core’s interfaces (via deasserting in_valid or asserting out_busy ) will result in reduced pipeline utilization, so you must make sure your interface logic can cope with these bursty transfers.

In most cases, you’ll probably encounter fewer issues by just using one of the buffered wrappers.

Synthesis and Place & Route

Synthesizing the core is straight-forward. For typical designs, it achieves timing closure with minimal effort and few changes to synthesis options.

The design has been tested in Xilinx ISE 13.1 and Altera Quartus 11.0

A complete list of files (Verilog modules and includes) needed by the pre-filtered core:

alu/rtl/dlsc_absdiff.v alu/rtl/dlsc_adder_tree.v alu/rtl/dlsc_compex.v alu/rtl/dlsc_divu.v alu/rtl/dlsc_min_tree.v alu/rtl/dlsc_multu.v common/rtl/dlsc_clog2.vh common/rtl/dlsc_synthesis.vh mem/rtl/dlsc_fifo_shiftreg.v mem/rtl/dlsc_pipedelay.v mem/rtl/dlsc_pipedelay_clken.v mem/rtl/dlsc_pipedelay_rst.v mem/rtl/dlsc_pipedelay_valid.v mem/rtl/dlsc_pipereg.v mem/rtl/dlsc_ram_dp.v mem/rtl/dlsc_ram_dp_slice.v mem/rtl/dlsc_shiftreg.v rvh/rtl/dlsc_rowbuffer.v rvh/rtl/dlsc_rowbuffer_combiner.v rvh/rtl/dlsc_rowbuffer_splitter.v rvh/rtl/dlsc_rvh_decoupler.v stereo/rtl/dlsc_stereobm_backend.v stereo/rtl/dlsc_stereobm_buffered.v stereo/rtl/dlsc_stereobm_core.v stereo/rtl/dlsc_stereobm_disparity.v stereo/rtl/dlsc_stereobm_disparity_slice.v stereo/rtl/dlsc_stereobm_frontend.v stereo/rtl/dlsc_stereobm_frontend_control.v stereo/rtl/dlsc_stereobm_multipipe.v stereo/rtl/dlsc_stereobm_pipe.v stereo/rtl/dlsc_stereobm_pipe_accumulator.v stereo/rtl/dlsc_stereobm_pipe_accumulator_slice.v stereo/rtl/dlsc_stereobm_pipe_adder.v stereo/rtl/dlsc_stereobm_pipe_adder_slice.v stereo/rtl/dlsc_stereobm_postprocess.v stereo/rtl/dlsc_stereobm_postprocess_subpixel.v stereo/rtl/dlsc_stereobm_postprocess_uniqueness.v stereo/rtl/dlsc_stereobm_prefiltered.v stereo/rtl/dlsc_xsobel_core.v sync/rtl/dlsc_domaincross.v sync/rtl/dlsc_domaincross_slice.v sync/rtl/dlsc_rstsync.v sync/rtl/dlsc_syncflop.v sync/rtl/dlsc_syncflop_slice.v

Two of those files (the .vh files in common/rtl/) are `include files; you’ll need to ensure your synthesis tool’s include path can find them (most simulators use “+incdir”; XST uses the “-vlgincdir” option; I’ve yet to find documentation on what the Altera equivalent is).

You should also set two global `defines: SYNTHESIS and XILINX (or ALTERA ). Those aren’t strictly required, but they will improve timing/performance. They enable additional Verilog metacomment synthesis directives embedded throughout the design (on Xilinx devices anyway; I have not performed this level of optimization for Altera devices yet.. but it’s on the to-do list). All of the metacomments are defined in a central place ( dlsc_synthesis.vh ), and referenced via `defines (to allow for swapping compatible Altera directives for Xilinx directives; hence the vendor-specific define).

For Xilinx, you should ensure that timing-driven mapping is enabled (enabled by default for newer 6-series devices; optional for older ones). If it’s not enabled, you may run into problems with slow, high-fanout paths (e.g. on the reset net). The core contains register duplication directives to help with fanout issues, but they won’t work without timing-driven mapping.

If your synthesis tool doesn’t support constant functions, but does support $clog2 , you should also define USE_CLOG2 (this controls what method is used by dlsc_clog2.vh to implement a clog2 function). Both ISE/XST and Quartus work fine with the default.

Both of the clock nets should be constrained somewhere in your design. Here’s an example UCF file I’ve been using for synthesis testing (with the wrapper as the top-level module):

NET "clk" TNM_NET = clk; TIMESPEC TS_clk = PERIOD "clk" 50 MHz HIGH 50%; NET "core_clk" TNM_NET = core_clk; TIMESPEC TS_core_clk = PERIOD "core_clk" 150 MHz HIGH 50%;

The asynchronous boundary crossings inside the modules in sync/rtl/ may require special false-path constraints. For Xilinx targets, this should already be handled by the aforementioned Verilog metacomments (provided you’ve `defined SYNTHESIS and XILINX ). For other targets (Altera), I haven’t yet taken care of this.

Example results

For all of the examples presented here, dlsc_stereobm_prefiltered was synthesized as the top-level module. The resource utilization numbers are for a completely placed and routed design with clean timing results. Timing-driven mapping was always used (for Xilinx); other synthesis and place&route options were left at their defaults. ISE 13.1 was used for all Xilinx examples. Quartus 11.0 was used for the lone Altera example.

These examples don’t necessarily represent optimal instantiations; there may be other combinations of parameters (especially of MULT_D and MULT_R) that yield more efficient implementations. Running a trial synthesis and place&route operation is easy and relatively quick, so I’d encourage you to experiment with other configurations.

Examples include:

320×240 QVGA @ 30 FPS in a Spartan-3E 250

We’ll start with a small example, and move up from there: Quarter-VGA (320×240) at just 30 FPS – less than a 2.5 MHz effective pixel clock.

We want to search 48 disparities (15% of the width) with a SAD window of 11×11.

2.5 MHz is slow enough that we can easily handle it without any parallelization: 48 disparities * 2.5 MHz = 120 MHz .

This may be low resolution, but we still want most of the bells-and-whistles: texture filtering, uniqueness-ratio checking and sub-pixel interpolation. We have no interest in using the xsobel filtered image data after the pipeline.

Thus, we instantiate with these parameters:

dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 320 ), .IMG_HEIGHT ( 240 ), .DISP_BITS ( 6 ), .DISPARITIES ( 48 ), .SAD_WINDOW ( 11 ), .TEXTURE ( 500 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 1 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 1 ), .MULT_R ( 1 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 1 ) )

Targetting the 2nd-smallest Spartan-3E (an XC3S250E-4VQ100), we get:

Resource Used Available Utilization Flip-flops 3,011 4,896 61% LUTs 1,528 4,896 31% Slices 1,902 2,448 77% RAMB16s 10 12 83% MULT18X18SIOs 1 12 8%

An easy fit with lots of logic to spare. RAM is a bit tight, however. On the lower end of things, the core tends to be block-RAM limited, rather than logic limited.

320×240 QVGA @ 120 FPS in a Spartan-3E 250

So, obviously, we need to pack some more logic into the design. Quadrupling the frame-rate ought to do it: 320×240 @ 120 FPS, for an effective pixel clock of around 9.2 MHz.

We’ll up the core clock to 160 MHz and set MULT_D = 3 to achieve the required throughput. Remember, with TEXTURE enabled, we’re trying to satisfy the equation:

MULT_D * (MULT_R * (core_clk / pixel_clk) - 1) >= DISPARITIES

..which we do: 3 * (1 * (160/9.2) - 1) = 49.1 >= 48

The only difference in the instantiation is changing MULT_D from 1 to 3.

Targetting the same Spartan-3E 250, we get:

Resource Used Available Utilization Flip-flops 3,935 4,896 80% LUTs 2,394 4,896 48% Slices 2,325 2,448 94% RAMB16s 10 12 83% MULT18X18SIOs 1 12 8%

Excellent. The core is very low latency (on the order of SAD_WINDOW rows of delay; an image rectification front-end will contribute more), so it might make sense to take a high-framerate configuration like this and use it in a hard-realtime control system (perhaps for a quadrotor UAV..).

And there’s still some logic to spare. The RAM problem, of course, persists; once we try to add some additional functions (like image rectification), we may have to upgrade to a slightly larger device (e.g. a Spartan-3E 500).

640×480 VGA @ 30 FPS in a Spartan-3E 500

We’ve now upgraded to a 500E; why stop at QVGA?

The source is now 640×480 @ 30 FPS (but with the same 9.2 MHz effective pixel clock). Now, however, we want to search 96 disparities (still 15%) with a larger SAD window of 15×15.

We’ll set the core clock to 150 MHz (~9.2 * 16). With a 16x frequency advantage, we still need 96 / 16 = 6 parallel pipelines; set MULT_D = 6 and leave MULT_R = 1 . Resultant parameters are:

dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 640 ), .IMG_HEIGHT ( 480 ), .DISP_BITS ( 7 ), .DISPARITIES ( 96 ), .SAD_WINDOW ( 15 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 4 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 6 ), .MULT_R ( 1 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 1 ) )

This fits nicely into a small Spartan-3E 500 (XC3S500E-4VQ100):

Resource Used Available Utilization Flip-flops 5,856 9,312 62% LUTs 4,074 9,312 43% Slices 3,780 4,656 81% RAMB16s 16 20 80% MULT18X18SIOs 1 20 5%

We probably even have enough space leftover to fit a complete system. Nice.

If we wind up too short on RAM, we can drop the resolution by 20% to 512×480; this saves a lot of memory (powers-of-2 and all that..):

Resource Used Available Utilization Flip-flops 5,803 9,312 62% LUTs 4,066 9,312 43% Slices 3,838 4,656 82% RAMB16s 10 20 50% MULT18X18SIOs 1 20 5%

800×480 @ 60 FPS in Spartan-6 LX25

Higher resolution is nice; so is frame-rate. Maybe we want both: 800×480 @ 60 FPS. (similar to the output of an MT9V032 image sensor, as seen in the previous throughput example).

That’s around a 23 MHz pixel clock. We’ll now be searching 120 disparities (15%) with a SAD window of 17×17.

Our pixel clock is now nearly 10x that of the original 30 FPS QVGA example, and our search space has increased by 150% (that’s around 25x the overall compute requirement). Parallelization is a must. We’ll target a newer Spartan-6 device, and figure on an easy-to-route 138 MHz core clock (6 * 23 MHz). That leaves us with a 20x throughput deficit (120 disparities / 6 = 20). MULT_D = 10 and MULT_R = 2 will fix that.

Again: most of the bells-and-whistles, but we’re willing to drop texture filtering (relying only on uniqueness-ratio checking to catch poor results). Resultant parameters are:

dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 800 ), .IMG_HEIGHT ( 480 ), .DISP_BITS ( 7 ), .DISPARITIES ( 120 ), .SAD_WINDOW ( 17 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 4 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 10 ), .MULT_R ( 2 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 0 ), .PIPELINE_LUT4 ( 0 ) )

Targetting a Spartan-6 LX25 (XC6SLX25-2FTG256), we get:

Resource Used Available Utilization Flip-flops 10,404 30,064 34% LUTs 7,235 15,032 48% Slices 2,615 3,758 69% RAMB16BWERs 19 52 36% RAMB8BWERs 3 104 2% DSP48A1s 0 38 0%

Xilinx’s synthesis tool seems to have a lot of trouble estimating how much logic this design takes:

Found area constraint ratio of 100 (+ 5) on block dlsc_stereobm_prefiltered, actual ratio is 169. Optimizing block <dlsc_stereobm_prefiltered> to meet ratio 100 (+ 5) of 3758 slices : WARNING:Xst:2254 - Area constraint could not be met for block <dlsc_stereobm_prefiltered>, final ratio is 165.

70% slice usage is relatively high, but it’s a far cry from 170%! I haven’t yet tracked down the source of this estimate; my current theory is that it isn’t taking into account the savings from mapping shift-registers into LUTs (the core uses a considerable number of them for pipeline delay matching). Thankfully, this warning doesn’t appear to impact the final results.

The first prototype of my complete stereo vision system will eventually wind up on a Xilinx SP605 board – which houses a Spartan-6 LX45 (the next step up from the LX25). Given that this 60 FPS WVGA example represents my design goal for the system, I should have plenty of FPGA logic leftover for exploring other features (maybe a multi-baseline setup with 3 cameras..).

Going in the opposite direction, this configuration will actually fit in an even smaller Spartan-6 LX16 device (XC6SLX16-2FTG256):

Resource Used Available Utilization Flip-flops 10,404 18,224 57% LUTs 7,300 9,112 80% Slices 2,267 2,278 99% RAMB16BWERs 19 32 59% RAMB8BWERs 3 64 4% DSP48A1s 0 32 0%

But you’d have a heck of a time putting any other logic in there with it!

800×480 @ 60 FPS in Spartan-3E 1200

Here are the results for the same design (but with PIPELINE_LUT4 set) in an older Spartan-3E 1200 (XC3S1200E-4FT256):

Resource Used Available Utilization Flip-flops 12,318 17,344 71% LUTs 9,168 17,344 52% Slices 7,555 8,672 87% RAMB16s 22 28 78% MULT18X18SIOs 2 28 7%

800×480 @ 60 FPS in Cyclone IV EP4CE22

And once more, but so-as to not be too Xilinx-biased, we’ll run it in an Altera Cyclone IV E part (EP4CE22F17C8):

Resource Used Available Utilization Registers 12,851 22,320 58% Logic Elements 14,894 22,320 67% Memory bits 286,088 608,256 47% Multipliers 0 132 0%

The register usage is very similar to that of the Spartan-3E, but the logic usage is significantly higher. The Cyclone IV’s LEs should be very comparable to the Spartan-3E’s LUTs, so this is a bit perplexing. It’s possible that this is due to a less efficient (or non-existent) shift-register implementation in the Cyclone IV LEs (Altera’s low-end devices have historically lacked the ability to use LEs as small RAMs – in contrast to Xilinx’s distributed RAM and SRL16s).

Again, I haven’t performed any Altera-specific optimization yet. Despite this, the core makes it through Altera’s implementation flow without any modifications, and achieves good performance – that’s the benefit of using inference to create device-agnostic designs!

1920×1080 @ 30 FPS in Spartan-6 LX75

VGA (wide or otherwise) is a bit old-fashioned; let’s try something with higher.. definition: 1920×1080 @ 30 FPS (1080p30).

We’ll spec a search space of 300 disparities with a SAD window of 25×25.

1920×1080 @ 30 FPS is around a 60 MHz effective pixel clock. We’re still in a Spartan-6, so we’ll limit ourselves to a 150 MHz core clock (2.5 * 60). That leaves us with 300 / 2.5 = 120 disparities to process in parallel. That’s quite a lot, but I think we can manage it (in fact, I know we can, since I’ve already written the results). MULT_D = 30 and MULT_R = 4 works out nicely.

Just to double-check some of the parameter restrictions: SAD_WINDOW/2 = 25/2 = 12 which is a multiple of MULT_R = 4 ; IMG_HEIGHT = 1080 is a multiple as well. DISPARITIES = 300 is a multiple of MULT_D = 30 . All is well.

We expect the device to be pretty full, so we’ll skip strict sub-pixel compatibility with OpenCV. We’ll also omit texture filtering, since it would be a huge waste of resources (29 of those 30 parallel pipes would be idle on the final texture pass; and we’d have to up the clock frequency to compensate; this may be fixed in the future).

This yields the parameters:

dlsc_stereobm_prefiltered #( .DATA ( 8 ), .DATAF ( 4 ), .DATAF_MAX ( 14 ), .IMG_WIDTH ( 1920 ), .IMG_HEIGHT ( 1080 ), .DISP_BITS ( 9 ), .DISPARITIES ( 300 ), .SAD_WINDOW ( 25 ), .TEXTURE ( 0 ), .SUB_BITS ( 4 ), .SUB_BITS_EXTRA ( 0 ), .UNIQUE_MUL ( 1 ), .UNIQUE_DIV ( 4 ), .OUT_LEFT ( 0 ), .OUT_RIGHT ( 0 ), .MULT_D ( 30 ), .MULT_R ( 4 ), .PIPELINE_BRAM_RD ( 1 ), .PIPELINE_BRAM_WR ( 0 ), .PIPELINE_FANOUT ( 1 ), .PIPELINE_LUT4 ( 0 ) )

Targetting a Spartan-6 LX75 (XC6SLX75-3FGG484), we get:

Resource Used Available Utilization Flip-flops 45,230 93,296 48% LUTs 29,315 46,648 62% Slices 10,843 11,662 92% RAMB16BWERs 74 172 43% RAMB8BWERs 3 344 1% DSP48A1s 0 132 0%

That’s pretty heavily utilized. As a result, we had to upgrade to a -3 part to achieve timing closure (thankfully, the price premium on faster Spartan-6’s isn’t anywhere near as hefty as with Virtex devices). It may be prudent to upgrade to a larger (but slower speed grade) device. (I’d have used an LX100 or LX150 here, but Xilinx’s free ISE Webpack license precludes such things).

Interestingly, this design is around 6.8x the compute requirement of the previous 60 FPS WVGA example (2.7x for pixel rate times 2.5x for search-space), but we’re managing to fit in a device that’s only 3-4.5x the size. Scaling is rarely linear.

What if we, for whatever reason, decided that we didn’t need any post-processing? Setting SUB_BITS = 0 and UNIQUE_MUL = 0 ( TEXTURE is already disabled) and resynthesizing, we get:

Resource Used Available Utilization Flip-flops 34,677 93,296 37% LUTs 21,984 46,648 47% Slices 8,821 11,662 75% RAMB16BWERs 47 172 27% RAMB8BWERs 3 344 1% DSP48A1s 0 132 0%

That’s a big savings, but not one that you’re likely to want to take advantage of (un-post-processed disparity maps leave something to be desired).

Higher-end devices

I won’t be presenting any specific examples targetting higher-end devices (e.g. a Virtex-6), as this core (by itself) really doesn’t make good use of such a potent device (the core has no use for DSP blocks, nor for gigabit transceivers).

It’s certainly possible to contrive an example that requires the resources of a large Virtex-6 (say, Quad-HD 3840×2160 @ 30 FPS – a 250 MHz pixel clock), but I wouldn’t recommend using this core in that scenario; the brute-force block-matching algorithm employed for this module (and by OpenCV) is well-suited to relatively low-resolution inputs, but becomes less-and-less practical as resolutions and search-spaces are increased.

A more efficient algorithm would not need to exhaustively search every possible disparity. One very simple approach (if one didn’t want to develop an entirely new algorithm) would be to apply the block-matching algorithm to a sub-sampled version of the source, and use the results of that as a starting point for an optimization algorithm that runs on the full-resolution images. That’s a bit beyond the immediate scope of this project.

Now, if you’re already using a Virtex-6 for other image-processing applications, and just want to add some stereo processing: I think you’ll find that a ~300 MHz stereo vision core can process a lot of pixels for not a lot of logic usage!

Inside the Core

The core itself is broken into 5-7 major pieces (depending on which (if any) wrapper you’re using):

Pre-filtering

dlsc_xsobel_core is an independent pipeline block included by the pre-filtered wrapper. It performs the same pre-filtering as OpenCV’s prefilterXSobel function (but only if you have SSE2 disabled; OpenCV generates slightly different results with SSE2 enabled), which is invoked when you use the CV_STEREO_BM_XSOBEL pre-filtering option.

Since the xsobel filter requires 3 rows of image data to produce a single row of output, the xsobel core includes enough RAM to buffer 2 rows of incoming data. The 3rd row comes straight from the input without buffering.

The actual filtering operation is fully pipelined and can (almost) handle 1 pixel per clock cycle. Due to the filter requiring a 3×3 window around a given output pixel, actual throughput is slightly lower: the xsobel core spends IMG_WIDTH + 1 cycles on each row, and must run for IMG_HEIGHT + 1 rows to complete an entire frame.

Front-end

dlsc_stereobm_frontend is the first pipeline component in the stereo correspondence core. It’s responsible for buffering enough incoming image rows to create an entire sum-of-absolute-differences window, and for sending needed pixels down the rest of the pipeline.

The front-end buffers (SAD_WINDOW + MULT_R - 1) rows each for the left and right image inputs. When the row buffers are full, the front-end begins sending pixels down the correspondence pipeline. It simultaneously sends the corresponding image data to the back-end for output.

It’s designed with throughput in mind, and can keep the pipeline 100% utilized once the initial buffering phase is completed (assuming no input or output throttling, of course). Currently, the front-end is able to overlap the initial buffering phase of the next frame with the processing of the final rows for the current frame (it will only do this if the input can immediately supply data for the next frame). This can further improve pipeline utilization.

The front-end makes (DISPARITIES / MULT_D) + (TEXTURE ? 1 : 0) passes over a given set of MULT_R rows. On each pass, it sends the same left-image data down the pipeline. The right-image data is different for each pass, to enable computing the SAD values for different disparity levels.

On the final pass for a given set of rows, the front-end shuffles all of the buffered rows down and loads a new set of input rows onto the top of the buffers. This is performed in conjunction with sending the final pass’ data down the pipeline; no pipeline cycles are wasted for loading/shuffling data.

Multi-pipe

dlsc_stereobm_multipipe is the heart of the block-matching stereo correspondence algorithm. Well, actually, dlsc_stereobm_pipe is; the multi-pipe merely instantiates MULT_D instances of the SAD pipeline, in order to enable processing multiple disparity levels in parallel.

The multi-pipe processes MULT_D adjacent disparities simultaneously from a single stream of left/right input image data. The left image data is sent to all SAD pipelines, while the right image data is cascaded through each pipeline to the adjoining pipeline (with a single cycle delay on each hop). This cascading causes each pipeline to be operating on a different disparity level (1 different than its neighbor).

Each SAD pipeline comprises three note-worthy components: an absolute-differences block, an adder tree, and a window accumulator. In conjunction, these blocks enable the pipeline to implement a sum-of-absolute-differences function over a SAD_WINDOW x SAD_WINDOW window with an output rate of 1 pixel per cycle.

The absolute-differences block (as the name implies) merely computes the absolute difference between each pair of left/right input pixels.

The adder tree takes the resultant column of SAD_WINDOW absolute differences and sums them into a single value. This value is still only a single column out of the overall SAD window.

The SAD window accumulator takes SAD columns and (again, as the name implies) accumulates them into a complete SAD window. In operation, the window slides laterally along the image at 1 pixel per cycle. Rather than re-compute the entire window each cycle, the accumulator keeps track of the previous SAD_WINDOW column sums. By accumulating new column sums, and subtracting old ones that are falling outside the window, it creates the complete SAD window with minimal effort.

When processing multiple rows in parallel ( MULT_R > 1 ), the SAD pipeline is able to share significant resources between each row.

The outputs of the each SAD pipeline in the multi-pipe are sent directly to the next stage in the overall pipeline, without first finding the “best” of the MULT_D SAD/disparity values.

The core of the SAD pipeline closely resembles the diagram that I presented way back in my original post:

(as it would be configured when SAD_WINDOW = 5 and MULT_R = 1 )

The left and right row buffers live in the previously described front-end; the SAD tree and accumulator live in this block; and the disparity buffers live in the disparity block:

Disparity

dlsc_stereobm_disparity has become somewhat of a “catch-all” for reconciling SAD pipeline outputs. Originally, before I added support for several of OpenCV’s post-processing operations (texture filtering, uniqueness filtering and sub-pixel interpolation), the disparity block was only responsible for comparing the results from the current pass against the results from a previous pass and finding the best result. Now it does rather more.

The pipeline takes multiple passes to process a given row, and each pass only covers a fraction of the overall disparity values. For this reason, the disparity block is responsible for maintaining state across multiple passes for an entire row. Only on the final pass is the disparity block able to produce a complete output.

Using the MULT_D SAD/disparity pairs from the multi-pipe, and the saved “best” values from previous passes, the disparity block finds:

The best overall SAD/disparity.

The 2nd-best SAD/disparity (excluding ones immediately adjacent to the current best disparity); this is used for uniqueness-ratio checking.

The SAD values for the disparities immediately adjacent to the current best disparity; this is used for sub-pixel interpolation.

The logic to do this is somewhat involved; requiring two many-input comparator trees and a bunch of control/masking logic.

The disparity block also performs the “filtering” part of the texture-filtering operation (the SAD pipeline is re-used to compute the “texture” value); this is merely a comparison operation.

On the final pass, the disparity block sends all of these results down the pipeline.

Post-processing

dlsc_stereobm_postprocess implements two optional post-processing operations: sub-pixel interpolation and uniqueness-ratio filtering.

Sub-pixel interpolation uses the SAD values of the two disparities adjacent to the winning disparity in order to compute a more accurate final disparity. In the OpenCV implementation, this requires a small (8-bit result) divider; this divider is implemented in a pipelined fashion in the core (comprising 8 subtractors and some muxing logic). If you forgo strict OpenCV-compatibility and set SUB_BITS_EXTRA to 0, this shrinks to 4-bits (4 subtractors).

Uniqueness-ratio filtering is simple: if the condition (best_sad + (best_sad * UNIQUE_MUL)/UNIQUE_DIV) < 2nd_best_sad isn’t satisfied, then the disparity value is filtered out. For small power-of-2 UNIQUE_MUL values, adders will typically be inferred; for larger values, hardware multipliers are likely to be inferred.

The results of the uniqueness-ratio filtering are merged (OR’ed) with the texture filtering results before being sent to the final pipeline stage.

Back-end

dlsc_stereobm_backend used to perform a lot more, before I moved output buffering functions into an external module. These days, the back-end is primarily responsible for generating blank/invalid disparity values for pixels that fall outside the valid region (that is, pixels that will never traverse the pipeline).

So-as to not overwhelm the output, the back-end monitors image data coming from the front-end and only generates invalid disparity values for pixels that have already been sent by the front-end. Valid disparity values can only come from the pipeline; they pass through the back-end without modification.

Asynchronous buffering

dlsc_rowbuffer modules are responsible for implementing the asynchronous input and output buffering present in the buffered and pre-filtered wrappers.

At their core, they resemble a standard asynchronous FIFO. But there’s a twist: in order to accommodate conversion between single-row I/O and multi-row I/O ( MULT_R > 1 ), they contain additional logic.

dlsc_rowbuffer_splitter takes in a single row at a time, and outputs MULT_R rows in parallel. To accomplish this, the FIFO allows the input to write multiple times to the same entry; only on the final write (to the final row) does the FIFO consider a value “pushed.” At that point, the output is able to see the whole set of rows, and can begin consuming them.

dlsc_rowbuffer_combiner works similarly; it takes in MULT_R rows in parallel, and outputs a single row at a time. The FIFO allows the output to peek at a given entry multiple times; only on the final read (from the final row) does the FIFO consider a value “popped”. At that point, the input is able to overwrite the old value. The combiner also generates an almost_full feedback signal that is used to throttle the stereo pipeline’s output when the buffer approaches full.

All of the communication between read/write ports of the rowbuffer must cross asynchronous clock domains. This is accomplished with a careful cross-domain handshake that ensures changes are propagated atomically and without risk of metastability (this is implemented in my dlsc_domaincross module). Gray coded address are not used, so-as to simplify control logic and to more readily enable non-power-of-2 buffer depths.

Verification

Simulation has played an absolutely critical role in the development and verification of the stereo correspondence core. For a module of this size, it’s really the only practical strategy (an iterative write/synthesize/program/check-the-blinky-LEDs-for-correctness/repeat sort of process would have taken a very long time indeed).

Verification has been somewhat of a two step process: the first step was developing a C++ reference model that was functionally similar to the OpenCV implementation (but written for algorithmic clarity, rather than performance) and confirming that it did, in fact, match OpenCV. The second step then used this reference model to verify the actual Verilog implementation.

(the C++ code for the reference model can, if you’re curious, be found in stereo/tb/dlsc_stereobm_models.cpp )

The core was originally written (though not necessarily designed) in a bottom-up fashion, with unit-tests being written in conjunction with new Verilog modules. The top-level tests now serve as the primary testbench for the core.

The tests are just as parameterized as the Verilog modules they verify, so it’s easy to have verification coverage for a large number of possible core configuration (indeed: the initial addition of this bulk parameterized verification found a number of failing configuration corner-cases that were subsequently fixed).

SystemC

SystemC is my testbench language of choice (it’s primarily “just” a templated class/macro library that grafts HDL-like semantics onto native C++). Being C++, it’s much more testbench-friendly than plain-ol’-Verilog is.

Using SystemC for the testbench (and limiting Verilog code to just synthesizable constructs) opens up the option of using Verilator as the simulator. Verilator “synthesizes” Verilog into optimized C++ code (and optionally wraps that up in a SystemC module), which can be compiled right along with your testbench code to yield a relatively high-performance simulation executable (rivaling that of some very expensive commercial simulators). It’s on the order of 100x faster than the leading full-featured open-source Verilog simulator, Icarus Verilog. (Verilator is not full-featured, as it only supports the synthesizable subset of Verilog)

In the current incarnation of my verification environment, everything is handled through a system of Makefiles. To invoke, for example, the regression tests for dlsc_stereobm_prefiltered , you’d cd into stereo/tb and run something like: make -f dlsc_stereobm_prefiltered_tb.makefile -j8 sims

This launches several simulations in parallel, so the output won’t be terribly meaningful until it completes. After completion, a summary can be viewed with: make -f dlsc_stereobm_prefiltered_tb.makefile summary (this just greps the resultant log files for the final pass/fail report).

Similarly, you can get accumulated code coverage results using: make -f dlsc_stereobm_prefiltered_tb.makefile coverage_all (they’ll be well over 95% after a regression run). This is just block/line coverage (a necessary, but not really sufficient indicator of overall verification quality); I have not (thus far) written any functional coverage points for the core.

Most of my testbenches have a 1:1 mapping between .makefile and .sp (the actual SystemC/SystemPerl testbench) files. The top-level tests for the core are not quite this way; instead, a single testbench ( dlsc_stereobm_tb.sp ) is shared between them.

(there are some older testbenches around that haven’t been updated in a while; if a testbench doesn’t have a corresponding makefile, I probably haven’t gotten around to updating the testbench yet)

Verilog and Post-synthesis Simulation

In addition to the SystemC testbenches, I’ve also written a simplified pure-Verilog testbench for the pre-filtered core ( dlsc_stereobm_tbv.v ). Currently, my environment supports Verilog testbenches using the Icarus Verilog simulator. The Makefile-driven invocation is very similar.

If your verification environment doesn’t include SystemC, you may find the Verilog testbench to be a better starting point.

Since the C++ reference model can’t easily be run from within the Verilog testbench, the testbench relies on an external program to generate stimulus and expected-results files in a format compatible with Verilog’s $readmemh . dlsc_stereobm_models_program.cpp implements an executable wrapper around the C++ model.

The main reason for developing a Verilog testbench was for running post-synthesis and post-place&route simulations. Verilator (and, it turns out, Icarus as well) are unable to handle the complex non-synthesizable constructs that Xilinx necessarily uses in their simulation primitives. Thus, a simpler Verilog testbench is needed for executing simulations through Xilinx’s own ISim simulator.

I haven’t yet automated the process of invoking ISim, so post-synthesis simulations are currently a very manual process. The limited simulations that I have run indicate that the core works just fine after synthesis by Xilinx’s XST tool.

The reasoning behind wanting to run post-synthesis simulations is two-fold: to confirm that the synthesis tools are performing their job correctly, and to confirm that any inferred FPGA hard IP behaves the same as the inferable model (e.g. dual-port RAMs with special read/write conflict-resolution behavior).

Lacking a complete system with which to test in real hardware, post-synthesis simulation is a good way to confirm that a design will work correctly in the real world. And, while you may not be able to run as many test vectors against a slow simulation, simulation offers the added benefit of performing internal checks that hardware cannot (e.g. checking correct block RAM usage).

Environment setup

Like many HDL verification environments, this one ain’t entirely easy to setup. Getting it to work under Windows would be extremely difficult; I personally use Ubuntu 10.10 64-bit.

The environment has a variety of esoteric (outside the world of ASICs/FPGAs) tool dependencies (all free and mostly open-source):

Veripool – home of Wilson Snyder’s wholly remarkable collection of open-source verification tools, including a bunch I rely on: Verilator – required for simulating Verilog with SystemC testbenches. SystemPerl – pre-processor needed to convert my SystemPerl testbenches to SystemC, and to provide code coverage and waveform tracing functionality. Verliog-Perl – provides Verilog parsing functionality required by one of my utilities.

SystemC – required if using my SystemC/SystemPerl testbenches. Somewhat of a hassle to build on 64-bit Ubuntu.

Icarus Verilog – only required if you want to run simulations with Verilog testbenches.

OpenCV – needed by my testbenches for loading/processing test images.

There are likely to be other smaller dependencies lurking behind these major ones.

The verification environment also relies on a few environment variables being set for it to function:

SYSTEMC – set to the root of your SystemC kit.

– set to the root of your SystemC kit. SYSTEMC_ARCH – set to the architecture your SystemC library was compiled for (e.g. linux or linux64 ).

– set to the architecture your SystemC library was compiled for (e.g. or ). VERILATOR_ROOT – set to the root of your Verilator installation.

– set to the root of your Verilator installation. SYSTEMPERL – set to the root of your SystemPerl kit (source tree).

– set to the root of your SystemPerl kit (source tree). DLSC_MAKEFILE_TOP – must specify an absolute path to common/mk/dlsc_common_top.makefile within your checkout of the dls_cores repository. It can’t have any spaces in it (partly due to Make’s inability to cope with them, and compounded by my reliance on generated absolute paths in certain portions of my makefiles).

Future development

The core, as it is, is a solid Version 0.9 – maybe a 1.0 Beta (or if you’re a fan of date-based versioning: Version 20110605). But there’s always more work to be done (and this isn’t even my day job!).

First and foremost: a complete system – something that is complete enough to actually run on real hardware – is a major upcoming milestone. I’ll be working towards that before I spend a lot more resources improving the correspondence core itself. The image-rectification block is next on my to-do list.

(again, have a look at my previous FPGA Stereo Vision Project post for more information about the overall stereo vision system)

In a somewhat-particular order, here’s a list of some possible future work on the stereo correspondence core (excluding work on the rest of the system):

Automate post-synthesis and post-place&route simulations.

Make sure all non-trivial modules (not just the bigger ones) have up-to-date unit-tests.

Simulate and optimize design power consumption (e.g. using Xilinx Power Analyzer).

Add support for minDisparity (better compatibility with OpenCV).

(better compatibility with OpenCV). Make TEXTURE and UNIQUE_MUL parameters be run-time adjustable.

and parameters be run-time adjustable. Reduce throughput cost of texture filtering (avoid wasting most of the parallel pipelines on the texture pass).

Add support for producing a complete disparity map (get rid of the masked left-margin; this is already mostly supported by my reference model).

Add support for multiple-baseline stereo vision setups (more than 2 cameras).

Add support for multi-channel setups (color RGB or YUV instead of just monochrome).

Implement “block” processing for reduced memory footprint and less bursty interfaces.

Improve support for Altera devices (perform timing optimizations; add Altera-specific Verilog metacomments).

Investigate more sophisticated correspondence algorithms (likely outside the scope of this particular block-matching correspondence core).

Downloads

All of my open-source Verilog FPGA IP cores are available in a Mercurial repository hosted by bitbucket: dls_cores. (at the moment, “all” largely means “this stereo correspondence core”; though there are some work-in-progress blocks lurking there too). You can download a ZIP snapshot of it, as well (that snapshot won’t, however, track future updates).

The entire repository (except where otherwise noted) is made freely available under a 3-clause BSD license.

Refer to the file list in the synthesis section for a list of all the relevant parts of the repository.