Smith-Waterman algorithm

The SW algorithm belongs to a class of algorithms known as dynamic programming. Dynamic programming is used when a large search space can be structured into a succession of stages such that the initial stage contains trivial solutions to subproblems [9]. Typically, this involves structuring the problem to an iterative calculation of cells in a scoring matrix. The following is the commonly used scheme to compute the score of a single cell, score_x, in the score matrix:

score_x = max {score_nw + match_bonus,

score_nw + mismatch_penalty,

score_n - opening_gap_penalty - extension_gap_penality,

score_w - opening_gap_penalty - extension_gap_penality,}

score_nw, score_n and score_w are the scores of the cells to the upper-left (NW), above (N) and left (W) of cell X, respectively (Figure 1). For simplicity, in our case, the match_bonus was 1 if the additional letters to the alignment are equal; the mismatch penalty was 1 if letters are not equal; the opening_gap_penalty was 1; the extension_gap_penalty was 0.1 for each additional gap. Thus, the score of each cell in the 2D matrix (except for the upper left corner) is calculated by three of its neighbouring cells.

Figure 1 Basic structure of the SW matrix. Each cell records the score of the SW matrix, which depends on the search and target sequences. NW, N, and W are cells to the northwest, north, and west of the cell of interest X. Full size image

Software implementation

A pure software implementation of the SW algorithm was developed in the C language to benchmark against FPGA-based implementations. A single_cell_module (SCM) was programmed containing the following I/O parameters: score_nw, score_n, score_w, flag_nw, flag_n, flag_w, flag_gap and result_score. The input parameters score_nw, score_n and score_w are scores of the NW, N and W neighboring cells, respectively. The input parameters flag_nw, flag_n, and flag_w indicate the direction of the gap (00 2 if no gap; 01 2 , gap from the target sequence; 102, gap from the search sequence) of the NW, N and W neighboring cells, respectively. Since the direction of the gap is known from the neighboring cells by the flags, we can determine if the incremental gap penalty of the cell of interest is an opening or extension gap penalty.

Thus, we can perform an affine gap penalty. The output parameters (score_gap and result_score) give the direction of the gap and the final score of the cell of interest, respectively.

The program first loads the target and search sequences into local memory from two text files stored in the flash memory. Then, their sequence lengths are determined and the scoring and gap matrices are created with dimensions of the above sequence lengths. Next, the score of each cell in the SW scoring matrix is calculated using the SCM. Lastly, the completed SW scoring matrix outputs the highest score in the matrix.

Custom instruction (CI) for SCM using an FPGA

Since calculating the SCM in the SW scoring matrix is repeated, this routine is a good candidate for FPGA-based hardware acceleration. The reconfigurable logic elements in FPGA can be optimally configured to run specific functions through the implementation of custom microprocessor instructions, which are assigned logic elements that perform user-defined operations. Custom instructions, in particular, allow the passing of multiple inputs and outputs in a single clock cycle while the pure software implementation using a conventional microprocessor is limited by the instruction set.

The SCM in the pure software implementation was converted to an equivalent FPGA-based custom instruction (hereafter, called 1×SCM) written in the Verilog hardware description language. Since the format for the custom instruction provided by our FPGA board (Altera Stratix) only permits two 32-bit inputs (Input_A and Input B, Figure 2) and our 1×SCM requires 6 inputs (3 scores and 3 flags), the inputs are partitioned and rearranged to be all read in a single clock cycle. Recall that the inputs for the SCM in the pure software implementation are score_nw, score_n, score_w, flag_nw, flag_n, and flag_w. Using bit masking and shifting bit operations, all input scores and their flags are passed to the 1×SCM of cell of interest in one clock cycle (Figure 2). The CI produces the cell score and flags quickly because it makes use of custom hardware, rather than using the standard instruction set of the Nios II as in the software version. The maximum field propagation delay of the 1×SCM was estimated to be 21 ns. Thus, the clock speed of this computation could be no faster than 47.6 MHz. Using the 1×SCM, we computed the cell score and gap flag calculations using a single instruction rather than several.

Figure 2 1×SCM I/O instruction and arrangement. A. Bit partition of custom instruction for the 1×SCM. The input_A and input_B are two 32-bit data containing the scores and flags from the three neighbouring cells (north, northwest and west). The output is one 32-bit data containing the final scores and the direction of alignment gap. The grey areas indicate the unused data bits. B. Schematic design of the inputs and outputs from one 1×SCM. Full size image

Lastly, we added the CI for the scoring of a single cell to the instruction set of the Nios II soft microprocessor on the FPGA, so that it can be called in a C program. The flow of computation is identical to that of the software implementation, except that instead of calling a function which describes the SCM, we call the CI.

A grid design of SCMs using an FPGA

To further improve the computation speed, we combined 64 instances of 1×SCM into an 8 by 8 grid module (hereafter, called 64×SCM) (Figure 3A), the maximum size allowed by our FPGA board. We programmed the FPGA such that within the grid, the score update of each 1×SCM is not synchronized to a clock, but rather triggered by the changes of scores in neighbouring cells in the W, NW and N direction (Figure 3A). This asynchronous data processing method allows scores to propagate throughout the grid as fast as the field propagation speed allows in the FPGA logic gates, hence drastically improving the computation speed. This implementation can be thought of having all 64 1×SCMs processing at the same time, while the score updating propagates in the grid. The SW matrix is divided into as many grids as needed, which are then calculated with 64×SCM one by one. Because all logic circuits are connected inside the 64×SCM, it takes only one clock cycle to compute the entire 8 by 8 grid. The maximum field propagation delay of the 64×SCM was estimated to be 324 ns. Thus, the clock speed of this computation could be no faster than 3.1 MHz.

Figure 3 64×SCM signal and computation propagation. A. One 64×SCM module aligns an 8-character partial search sequence to an 8-character target sequence. The arrows show the propagation directions of the signals. Because the 64×SCM is unclocked, there is no pre-determined path of propagation. B. When the search and/or target sequences are greater than 8, the scoring matrix is partitioned into many 8 by 8 segments, each to be computed by the 64×SCM. Full size image

As input, this module requires segments of the search and target sequences with a length of up to 8 characters (the length and width of 64×SCM). Also, it requires the scores and gap flags stored from prior 64×SCM calculations in the NW, N and W direction. A second module was created to calculate the maximum score of the 64×SCM by a cascade of max-finders that first finds the maximal score of each column and then finds the maximum of the columns to determine the overall maximum (Figure 4). In order to process sequences longer than the dimension of 64×SCM (in this case, 8), a controller module was programmed to reuse the 64×SCM. This module included a SRAM (static random access memory) block to store scores and gap flags from previous 64×SCM calculations as well as a finite state machine (FSM) to control loading and storing values to the SRAM (Figure 5). Lastly, to access this hardware from a C program, a final interface module defines a set of registers to hold the sequences, lengths, various flags and the final score.

Figure 4 The max-finder implementations in 64×SCM. A. The 2-input max-finder circuit implementation. Both inputs and the output are 16-bit data representing the SW matrix score. B. Construction of an 8-input max-finder using 2-input max-finders. C. Max score computation of a 64×SCM. The scores of each column of cells in 64×SCM are inputted to a custom designed 8-input max-finder, the outputs of the 8 columns are then compared against each other using another 8-input max-finder. The output of the last comparator gives the highest score of the 64×SCM. Full size image

Figure 5 State diagram of the finite state machine (FSM). The states of the moore-type FSM are in the rounded rectangles. The output at each state are defined by the following vector - (we = write enable for SRAM blocks, rm = reset 64×SCM matrix, ena_seq = enable sequences to be loaded, ena_sf = enable scores and flags to be loaded). To clear all scores and flags from the matrix, the FSM is set to the 'Reset' state. Next, the FSM remains in the 'Wait for Sequence Load' state until two sequences of length 8 or less have been loaded by the C program. Once this loading is completed, the C program will assert the done_load signal. At this point, the FSM releases the matrix's reset signal which causes the sequences, scores and flags to propagate through the matrix. After a set delay determined by the critical path of the circuit, the FSM asserts the done_sw signal, and enables the values just calculated to be written into the RAM. Theses scores and flags will be read from the RAM for the next block. The FSM then returns to the 'Wait for Sequence Load' state, and waits for the next length of sequences to come from the C program. This loop is repeated until the entire Smith-Waterman matrix has been calculated and the score of the optimal alignment has been determined. Finally, the results are printed to a command window on the computer. The FSM can be reset by writing to a status register, allowing the matrix to be used for another set of sequences. Full size image

The flow of computation of this hardware controlled from a C program is as follows. First, the search and target sequences are loaded from flash memory and copied to local memory. Once this is done, an on-chip timer is started. Second, score and gap matrices are initiated and the values reset. Third, the search and target sequences are encoded by a custom instruction and loaded into the 64×SCM with their lengths. DNA bases are encoded into two bits (A = 00, T = 01, G = 10, C = 11). Lastly, the result propagates through the grid and completes in a time determined by the field propagation delay. If the sequences are longer than 8 characters, steps 3 and 4 are repeated for the next grid (Figure 3B). Once all grids have finished, the timer is stopped and the running time is displayed on the screen.

Testing

We tested and compared the performance of the three implementations (pure software, 1×SCM, and 64×SCM) for aligning two DNA sequences with identical lengths ranging from 1 to 1024 base-pairs. We performed the same input for each implementation and measured the time to complete the computations (Table 1). Alignment of the each sequence length was performed three times to produce the statistical variance, which was less than 0.5%. The scoring matrices from the three implementations were compared to ensure identical alignment results. The performance of the implementation was found to be independent to the sequence-similarity between the two DNA sequence queries (data not shown).

Table 1 Computation time comparison Full size table

The 1×SCM implementation produced a maximal 2-fold speed improvement over the pure software implementation running on the same FPGA with an Altera Nios II softprocessor, while the 64×SCM implementation produced a maximal of 160-fold improvement over the pure software implementation (Figure 6). When the sequence length was smaller or equal to the size of the 64×SCM implementation, the computation time did not increase as the length of the sequence increased (Figure 7A). In comparison to the pure software and 1×SCM implementations, the computational time increased proportionally to square of the sequence length. When the sequence length was larger than the size of the 64×SCM implementation, the slopes of the log (computation time) vs. log (sequence length) graphs of all three implementations approach the same value (Figure 7A) and the speed per cell approaches a constant value (Figure 7B). Thus, in this case, the computation time of all three implementations increased proportionally to square of the sequence length. This is expected as the field propagation of the 64×SCM implementation is restricted to the 8 by 8 grid. However, if we increase the size of the grid to cover the average size of sequences comparisons (for example, one thousand base-pairs), we could significantly improve the computation time for the majority of sequence alignment queries.

Figure 6 Speed improvements. Computation speed improvements (in folds) of the three implementations. Black, 64×SCM over pure software; grey, 64×SCM over 1×SCM; white, 1×SCM over pure software. Full size image