Overview

The DNPCIe_10G_K7_LL is a PCIe-based FPGA board designed to minimize input to output processing latency on 10Gb Ethernet packets. The primary application is for ultra low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10 GbE Ethernet packets can be analyzed and acted upon without interrupts or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.

The FPGA – Xilinx Kintex-7

We use a single FPGA from the Xilinx Kintex-7 in the FFG676 package. This package supports 400 I/O with the majority utilized. Most are dedicated to off chip memory peripherals including a single QDR II+ for low-latency, high speed look-up, and DDR3 Mini-DIMM for performance oriented bulk storage. The Kintex-7 FPGA contains high-speed transceivers capable of 10GbE without need for an external PHY. Four of these transceivers are used for 4-lanes of GEN2-capable PCIe. Two of the transceivers are connected to 10 GbE SFP+ sockets.

Two possible FPGAs can be stuffed: 7K410T or the 7K325T. Both FPGAs come in a variety of speed grades (-1,-2/2L, -3) with -3 being the fastest. The -1 speed grade is not rated for 10 GbE transceiver operation, so isn't applicable to this application. Table 1 depicts the resources of the two FPGAs with the Xilinx marketing exaggerations dispassionately amputated. These are both large, but low-cost FPGAs. The 7K410T is capable of handling ~3M ASIC gates of logic, with the 7K325T capable of ~2.3 million gates. Features of the Kintex-7 FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and second generation DSP48E1 slices (includes 25 x 18 multipliers). Floating point functions can be implemented using these DSP slices.

Two Channels of 10 GbE

The Kintex-7 FPGA have transceivers capable of 10 GbE. The physical interface is handled using SFP+ modules. This allows you to bypass a MAC if necessary and process raw Ethernet packets. The DNPCIe_10G_K7_LL has two 10 GbE channels and can support 10GBASET-ER, 10GBASET-SR, 10GBASET-KR.

QDR II+ SSRAM - Memory with the lowest latency

We use a single quad data rate static RAMs (QDR II+ SSRAM) in the 4M x 18 size (72Mbit). This type of memory has separate input and output data paths enabling maximum read/write data bandwidth with minimum latency. The maximum tested frequency of this memory is 400 MHz. To minimize processing latency, we suspect it will be best to clock this QDRII+ SRAM at 312.50 MHz, exactly twice the internal Ethernet controller frequency of 156.25 MHz. The Kintex-7 FPGAs are capable of generating internal 2x clocks that are phase synchronous, eliminating the latencies associated with the tricky re-synchronization of data moving between different clock frequencies. The internal controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the QDR II+ SSRAM can be exploited, including concurrent read and write operations and four-tick bursts. The only real limitation is the amount of time and effort spent in customizing the individual memory controllers.

DDR3 DRAM - A large amount of local, bulk memory

A single PC3-10600/PC3-12800 DDR3 Mini-uDIMM socket enables up to 4GB of memory for bulk storage and lookup. Assuming a 4GB DIMM, the memory configuration is 512M x 72. Using a -2 or -3 speed grade FPGA, this interface is tested at the maximum FPGA I/O frequency: 666.5 MHz (1333 Mb/s with DDR, PC3-10600), 800MHz (1600Mb/s with DDR, PC3-12800). You are welcome to use this memory as 64-bits with 8 bits of error correction (ECC), or as a 72-bit memory without correction.

To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR3 interface at a 3x multiple of the base Ethernet frequency of 156.25 MHz, which is 468.75 MHz. A 3x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR3 memory controller. The DDR3 controller can be optimized in any way you choose. We, of course, provide several verilog examples for no charge that you are welcome to use. All functions of the DDR3 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR3 memory utilized. As with the QDRII+ SRAM, the only real limitation is the amount of time and effort spent customizing the DDR3 memory controller to your needs.

PCIe – Customizable 4-lane, GEN2 PCI Express

PCIe is connected directly to the FPGA via 4-lanes of GTX transceivers. Note that the board has a 16-bit mechanical finger for stability. The interface is fully GEN2 capable. We ship PCIe IP that is a full function, fixed, 4-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers for C source for several operating systems are included no charge.

How Everything Works...

With direct data feeds such as NASDAQ ITCH and OUCH the DNPCIe_10G_K7_LL_QSFP contains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle. For algorithms requiring processing, FPGA resources can be hard coded to perform the task, including real-time Monte Carlo analysis and floating point. This makes DNPCIe_10G_K7_LL specifically suitable for compliance checking, high frequency trading, low latency trading, derivative pricing and risk management.

Specs of FPGAs Available on the DNPCIe_10G_K7_LL