DNA Sequence Alignment Algorithms: Needleman-Wunsch and Smith-Waterman In the field of bioinformatics, one way to measure similarities between two (or more) sequences of DNA is to perform "a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences." Think of it this way: you've got two random strands of DNA - how do you know where one starts and one begins? How do you know if they come from the same organism? A closely related pair? You might use sequence alignment to see how the two strands might line up in relation to each other - subsequences may indicate similar functionality, or conservation through evolution.



In "normal" programming terms, you've got a couple of strings and want to find out how you might align them so they they look as much like one another as possible.



There are plenty of ways to achieve that goal. Since we haven't done much programming on here lately, I thought it would be nice to focus on two very similar algorithms that do so:



The idea behind these two algorithms is that we have a scoring scheme we want to maximize as successive "matches" occur. One popular



The particular scoring matrix you use will be determined by the goals you want to acheive. For our purposes, a simple matrix or two will suffice:



@substitution_matrix = [[ " " , "a" , "c" , "g" , "t" ] , [ "a" , 1 , 0 , 0 , 0 ] , [ "c" , 0 , 1 , 0 , 0 ] , [ "g" , 0 , 0 , 1 , 0 ] , [ "t" , 0 , 0 , 0 , 1 ]] @substitution_matrix2 = [[ " " , "s" , "e" , "n" , "d" , "a" ] , [ "s" , 4 , 0 , 1 , 0 , 1 ] , [ "e" , 0 , 5 , 0 , 2 , - 1 ] , [ "n" , 1 , 0 , 6 , 1 , - 2 ] , [ "d" , 0 , 2 , 1 , 6 , - 2 ] , [ "a" , 1 , - 1 , - 2 , - 2 , 4 ]]

The first @substitution_matrix is fairly simplistic - give one point for each match, and ignore any mismatches or gaps introduced. In @substitution_matrix2 what score should be given if "s" is aligned with "a"? (One.) What if "d" is aligned with another "d"? (Six.) The substitution matrix is simply a table telling you how to score particular characters when they are in the same position in two different strings.



After you've determined a scoring scheme, the algorithm starts scoring each pairwise alignment, adding to or subtracting from the overall score to determine which alignment should be returned. It uses



It feels strange to implement this as a class, but I did it to make it clear how trivially easy it is to derive Smith-Waterman (SW) from Needleman-Wunsch (NW). One design that jumps out at me would be to have a SequenceAligner where you can choose which algorithm as a method to run - then SW could use a NW algorithm where min_score is passed as a parameter to the method. Perhaps you can think of something even better.



Anyway, here's the Ruby class that implements the Needleman-Wunsch algorithm.



class NeedlemanWunsch @min_score = nil def initialize ( a , b , substitution_matrix , gap_penalty ) @a = a @b = b # convert to array if a/b were strings @a = a . split ( "" ) if a . class == String @b = b . split ( "" ) if b . class == String @sm = substitution_matrix @gp = gap_penalty end def get_best_alignment construct_score_matrix return extract_best_alignment_from_ score_matrix end def construct_score_matrix return if @score_matrix != nil #return if we've already calculated it initialize_score_matrix traverse_score_matrix do | i , j | if i == 0 && j == 0 @score_matrix [ 0 ][ 0 ] = 0 elsif i == 0 #if this is a gap penalty square @score_matrix [ 0 ][ j ] = j * @gp elsif j == 0 #if this is a gap penalty square @score_matrix [ i ][ 0 ] = i * @gp else up = @score_matrix [ i - 1 ][ j ] + @gp left = @score_matrix [ i ][ j - 1 ] + @gp #@a and @b are off by 1 because we added cells for gaps in the matrix diag = @score_matrix [ i - 1 ][ j - 1 ] + s ( @a [ i - 1 ] , @b [ j - 1 ] ) max , how = diag , "D" max , how = up , "U" if up > max max , how = left , "L" if left > max @score_matrix [ i ][ j ] = max @score_matrix [ i ][ j ] = @min_score if @min_score != nil and max < @min_score @traceback_matrix [ i ][ j ] = how end end end def extract_best_alignment_from_ score_matrix i = @score_matrix . length - 1 j = @score_matrix [ 0 ]. length - 1 left = Array . new top = Array . new while i > 0 && j > 0 if @traceback_matrix [ i ][ j ] == "D" left . push ( @a [ i - 1 ] ) top . push ( @b [ j - 1 ] ) i -= 1 j -= 1 elsif @traceback_matrix [ i ][ j ] == "L" left . push "-" top . push @b [ j - 1 ] j -= 1 elsif @traceback_matrix [ i ][ j ] == "U" left . push @a [ i - 1 ] top . push "-" i -= 1 else puts "something strange happened" #this shouldn't happen end end return left . join . upcase . reverse , top . join . upcase . reverse end def print_score_visualization construct_score_matrix print_as_table ( @score_matrix ) end def print_traceback_matrix construct_score_matrix print_as_table ( @traceback_ matrix ) end def print_as_table ( the_matrix ) puts puts "a=" + @a . to_s puts "b=" + @b . to_s puts print " " @b . each_index { | elem | print " " + @b [ elem ]. to_s } puts "" traverse_score_matrix do | i , j | if j == 0 and i > 0 print @a [ i - 1 ] elsif j == 0 print " " end print " " + the_matrix [ i ][ j ]. to_s puts "" if j == the_matrix [ i ]. length - 1 end end def traverse_score_matrix @score_matrix . each_index do | i | @score_matrix [ i ]. each_index do | j | yield ( i , j ) end end end def initialize_score_matrix @score_matrix = Array . new ( @a . length + 1 ) @traceback_matrix = Array . new ( @a . length + 1 ) @score_matrix . each_index do | i | @score_matrix [ i ] = Array . new ( @b . length + 1 ) @traceback_matrix [ i ] = Array . new ( @b . length + 1 ) @traceback_matrix [ 0 ]. each_ index { | j | @traceback_matrix [ 0 ][ j ] = "L" if j! = 0 } end @traceback_matrix . each_index { | k | @traceback_matrix [ k ][ 0 ] = "U" if k! = 0 } @traceback_matrix [ 0 ][ 0 ] = "f" end def s ( a , b ) #check the score for bases a. b being aligned for i in 0 . . ( @sm . length - 1 ) break if a . downcase == @sm [ i ][ 0 ]. downcase end for j in 0 . . ( @sm . length - 1 ) break if b . downcase == @sm [ 0 ][ j ]. downcase end return @sm [ i ][ j ] end end

Needleman-Wunsch follows that path, and finds the best global alignment possible. Smith-Waterman truncates all negative scores to 0, with the idea being that as the alignment score gets smaller, the local alignment has come to an end. Thus, it's best to view it as a matrix, perhaps with some coloring to help you visualize the local alignments.



All we really need to get Smith-Waterman from our implementation of Needleman-Wunsch above is this:



class SmithWaterman < NeedlemanWunsch def initialize ( a , b , substitution_matrix , gap_penalty ) @min_score = 0 super ( a , b , substitution_matrix , gap_penalty ) end end

However, it would be nice to be able to get a visualization matrix. This matrix should be able to use windows of pairs instead of each and every pair, since there can be thousands or millions or billions of base pairs we're aligning. Let's add a couple of methods to that effect:



#modify array class to include extract_submatrix method class Array def extract_submatrix ( row_range , col_range ) self [ row_range ]. transpose [ col_ range ]. transpose end end require 'needleman-wunsch' class SmithWaterman < NeedlemanWunsch def initialize ( a , b , substitution_matrix , gap_penalty ) @min_score = 0 super ( a , b , substitution_matrix , gap_penalty ) end def print_score_visualization ( wind ow_size = nil ) return super () if window_size == nil construct_score_matrix #score_matrix base indexes si = 1 #windowed_matrix indexes wi = 0 windowed_matrix = initialize_windowed_matrix ( win dow_size ) #compute the windows while ( si < @score_matrix . length ) sj = 1 wj = 0 imax = si + window_size - 1 imax = @score_matrix . length - 1 if imax >= @score_matrix . length while ( sj < @score_matrix [ 0 ]. length ) jmax = sj + window_size - 1 jmax = @score_matrix [ 0 ]. length - 1 if jmax >= @score_matrix [ 0 ]. length current_window = @score_matrix . extract_ submatrix ( si . .imax , sj . .jmax ) current_window_score = 0 current_window . flatten . each { | elem | current_window_score += elem } begin windowed_matrix [ wi ][ wj ] = current_window_score rescue end wj += 1 sj += window_size end wi += 1 si += window_size end #find max score of windowed_matrix max_score = 0 windowed_matrix . flatten . each { | elem | max_score = elem if elem > max_score } max_score += 1 #so the max normalized score will be 9 and line up properly #normalize the windowed matrix to have scores 0-9 relative to percent of max_score windowed_matrix . each_index do | i | windowed_matrix [ i ]. each_index do | j | begin normalized_score = windowed_matrix [ i ][ j ]. to_f / max_score * 10 windowed_matrix [ i ][ j ] = normalized_score . to_i rescue end end end #print the windowed matrix windowed_matrix . each_index do | i | windowed_matrix [ i ]. each_index do | j | print windowed_matrix [ i ][ j ]. to_s end puts end end def initialize_windowed_matrix ( win dow_size ) windowed_matrix = Array . new ((( @a . length + 1 ) . to_f ) / window_size ) windowed_matrix . each_index do | i | windowed_matrix [ i ] = Array . new ((( @b . length + 1 ) . to_f ) / window_size ) end return windowed_matrix end end

And now we'll try it out. First, we take two sequences and perform a







Then, we can take our own visualization, do a search and replace to colorize the results by score, and have a look:







Lo and behold, they look quite similar!



I understand the algorithms are a bit complex and particularly well explained, so I invite questions about them in particular. As always, comments and (constructive) criticisms are encouraged as well.



Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee! In the field of bioinformatics, one way to measure similarities between two (or more) sequences ofis to perform sequence alignment Think of it this way: you've got two random strands of DNA - how do you know where one starts and one begins? How do you know if they come from the same organism? A closely related pair? You might use sequence alignment to see how the two strands might line up in relation to each other - subsequences may indicate similar functionality, or conservation through evolution.In "normal" programming terms, you've got a couple of strings and want to find out how you might align them so they they look as much like one another as possible.There are plenty of ways to achieve that goal. Since we haven't done much programming on here lately, I thought it would be nice to focus on two very similar algorithms that do so: Needleman-Wunsch and Smith-Waterman The idea behind these two algorithms is that we have a scoring scheme we want to maximize as successive "matches" occur. One popular substitution matrix for scoring protein alignment is BLOSUM 62 (and here's a good PDF describing how BLOSUM came about ).The particular scoring matrix you use will be determined by the goals you want to acheive. For our purposes, a simple matrix or two will suffice:The firstis fairly simplistic - give one point for each match, and ignore any mismatches or gaps introduced. Inwhat score should be given if "s" is aligned with "a"? (One.) What if "d" is aligned with another "d"? (Six.) The substitution matrix is simply a table telling you how to score particular characters when they are in the same position in two different strings.After you've determined a scoring scheme, the algorithm starts scoring each pairwise alignment, adding to or subtracting from the overall score to determine which alignment should be returned. It uses dynamic programming , storing calculations in a table to avoid re-computation, which allows it to reverse course after creating the table to find and return the best alignment.It feels strange to implement this as a class, but I did it to make it clear how trivially easy it is to derive Smith-Waterman (SW) from Needleman-Wunsch (NW). One design that jumps out at me would be to have awhere you can choose which algorithm as a method to run - then SW could use a NW algorithm whereis passed as a parameter to the method. Perhaps you can think of something even better.Anyway, here's the Ruby class that implements the Needleman-Wunsch algorithm.Needleman-Wunsch follows that path, and finds the best global alignment possible. Smith-Waterman truncates all negative scores to 0, with the idea being that as the alignment score gets smaller, the local alignment has come to an end. Thus, it's best to view it as a matrix, perhaps with some coloring to help you visualize the local alignments.All we really need to get Smith-Waterman from our implementation of Needleman-Wunsch above is this:However, it would be nice to be able to get a visualization matrix. This matrix should be able to use windows of pairs instead of each and every pair, since there can be thousands or millions or billions of base pairs we're aligning. Let's add a couple of methods to that effect:And now we'll try it out. First, we take two sequences and perform a DNA dotplot analysis on them:Then, we can take our own visualization, do a search and replace to colorize the results by score, and have a look:Lo and behold, they look quite similar!I understand the algorithms are a bit complex and particularly well explained, so I invite questions about them in particular. As always, comments and (constructive) criticisms are encouraged as well. Comments Leave a comment



thanks for posting this, i remember coding this in Matlab for a course I took from Ron Elber in grad school but it was a lot easier to find this way and easy enough to run from matlab via a "system" command if need be :)



Posted by Daniel Korenblum on Aug 26, 2013 at 11:47 AM UTC - 5 hrs

Leave a comment



Leave this field empty Your Name

Email (not displayed, more info? Privacy Information x Your email address is not displayed. It is used only to respond to you if needed, and send comments if you subscribe to this comment thread. It is stored in a cookie if you choose to "Remember my details". Your email address is not displayed. It is used only to respond to you if needed, and send comments if you subscribe to this comment thread. It is stored in a cookie if you choose to "Remember my details".

Website



Comment:



Subcribe to this comment thread

Remember my details

