*Note, if you want to skip the background / alignment calculations and go straight to where the code begins, just click here.

Dynamic Programming and DNA

Dynamic programming has many uses, including identifying the similarity between two different strands of DNA or RNA, protein alignment, and in various other applications in bioinformatics (in addition to many other fields). For anyone less familiar, dynamic programming is a coding paradigm that solves recursive problems by breaking them down into sub-problems using some type of data structure to store the sub-problem results. In this way, recursive problems (like the Fibonacci sequence for example) can be programmed much more efficiently because dynamic programming allows you to avoid duplicate (and hence, wasteful) calculations in your code. Click here to read more about dynamic programming.

Let’s dive into the main topic of this post by implementing an algorithm to measure similarity between two strands of DNA.

Motivation

Measuring the similarity between two different sequences of DNA is very useful because it can help tell us how closely related (or not) those sequences of DNA and their sources are (e.g. comparing the DNA of two different species, or two different genes). This helps to give insight into the structure and function of a newly identified gene or DNA sequence.

Background

For our similarity example, we will measure the global alignment between two strands of DNA, and then show how this can be done more generally with dynamic programming. “Global” means we will use the entirety of each sequence of DNA. “Alignment” means we are trying to align the sequences of DNA such that some score is maximized based upon how many matches / mismatches / gaps we have between the strands. By aligning the DNA sequences pairwise, we can create gaps in between nucleotides to create better alignments. We’ll get more into that in a moment. There are other ways of assessing DNA similarity, and I may cover some of them in a future article, but for now we’ll stick with global alignment.

DNA is made of four types of nucleotides – adenine, guanine, cytosine, and thymine. These are typically abbreviated by the letters A, G, C, and T, respectively. Programmatically, DNA can be represented as a string of characters, where each character must be one of A, G, C, or T. Suppose, then, that we have the two sequences of DNA as seen below.

Sequence 1 ==> G T C C A T A C A Sequence 2 ==> T C A T A T C A G

How to measure the similarity between DNA strands

We need a metric to use for computing the global alignment between DNA strands. We’re going to use a scoring method (see below) in conjunction with the Needlman-Wunsch algorithm to figure out the optimal global alignment. An example of performing non-optimal global alignment on our sequences is the following:

Sequence 1 ==> G T C C A T A C A Sequence 2 ==> T C A T A T C A G

In the above case we are merely lining up the sequences of DNA pairwise, and highlighting the matches between the sequences. In this way, we have two matches (hightlighted) and seven mismatches.

Let’s define the score we are trying to maximize as the following:



Add 1 for each match between the sequences

Subtract 1 for each mismatch

Subtract 1 for each gap i.e. insertion / deletion (shown by “-“)



The exact scoring metric (adding one for match, penalizing one for mismatch / gap) may vary between different scoring systems, but we’ll keep with our defined scoring methodology for the purposes of this post.

By this measure, our global alignment above gives us a score of 2 matches minus 7 mismatches = -5. We can improve this score if we take into account that each particular strand may have insertions or deletions i.e. we may shift the nucleotides (or characters) in each particular sequence to form a better alignment, provided we don’t actually reorder either sequence. In other words, we can shift nucleotides in a sequence as long as the order of the nucelotides doesn’t change. If that isn’t clear, the example below shows how we can achieve the maximum global alignment score by a modified alignment.

Sequence 1 ==> G T C C A T A - C A - Sequence 2 ==> - T C - A T A T C A G

The highlighted letters (i.e. nucelotides) show where there is a match between the DNA strands. Here we have seven matches, and four mismatches.

Our score is 7 – 4 = 3. This is much better than our previous score of -5! Note how the order of the nucleotides (i.e. characters in the “DNA string”) did not change in either sequence; we merely created gaps which represent insertions / deletions. The reason doing this makes sense biologically is that insertions or deletions of nucleotides are common forms of mutations. For example, two strands of DNA may be related in function, but mutations caused extra nucleotides to be inserted (or deleted) from one or both sequences.

Now, to come up with a more generalized way of finding a maximum score solution, let’s construct a table like below so that one of the sequences runs across the header, while the other runs across the rows.

– G T C C A T A C A – 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 T -1 C -2 A -3 T -4 A -5 T -6 C -7 A -8 G -9

Each particular cell’s value will represent the max score achieved by pairing each strand of DNA up until that many rows and columns. For example, the -3 in the top row (disregarding the header row) is the value -3 because that is the scored achieved by pairing a gap (“-“) with the first three nucelotides of the header DNA sequence, GTC.

Likewise, we get each of the values -1, …, -9 by pairing each successive subsequence of the header column to the gap, “-”

-, G ==> -1

-, GT ==> -2

-, GTC ==> -3,

-, GTCC ==> -4

-, GTCCA ==> -5

-, GTCCAT ==> -6

-, GTCCATA ==> -7

-, GTCCATAC ==> -8

-, GTCCATACA ==> -9

In each subsequence, every nucleotide is compared to “-“. Since no nucleotide in any of the subsequences matches “-“, the score of each matched to “-” is just -1 times the length of the subsequence.

Filling in another row and column gives us this:

G T C C A T A C A 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 T -1 -1 0 -1 -2 -3 -4 -5 -6 -7 C -2 -2 A -3 -3 T -4 -4 A -5 -5 T -6 -6 C -7 -7 A -8 -8 G -9 -7

The calculations for the first entries of the new row can be seen below:

G ====> G ====> -1 = -1 T T G T ====> G T ====> -1 + 1 = 0 T - T G T C ====> G T C ====> -1 + 1 - 1 = -1 T - T - G T C C ====> G T C C ====> -1 + 1 - 1 - 1 = -2 T - T - - G T C C A ====> G T C C A ====> -1 + 1 - 1 - 1 - 1 = -3 T - T - - - G T C C A T ====> G T C C A T ====> -1 + 1 - 1 - 1 - 1 - 1 = -4 T - T - - - - G T C C A T A ====> G T C C A T A ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 = -5 T - T - - - - - G T C C A T A C ====> G T C C A T A C ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 - 1 = -6 T - T - - - - - - G T C C A T A C A ====> G T C C A T A C A ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 = -7 T - T - - - - - - -

Finishing the rest of the table

Now we finish the rest of the table:

G T C C A T A C A – 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 T -1 -1 0 -1 -2 -3 -4 -5 -6 -7 C -2 -2 -1 1 0 -1 -2 -3 -4 -5 A -3 -3 -2 0 0 1 -1 -2 -3 -4 T -4 -4 -3 -1 -1 0 2 1 0 -1 A -5 -5 -4 -2 -1 0 1 2 1 0 T -6 -6 -5 -3 -3 0 1 2 1 1 C -7 -7 -6 -4 -2 -2 0 1 3 2 A -8 -8 -7 -5 -3 0 -1 1 2 4 G -9 -7 -8 -6 -4 -1 -1 1 1 3



The diagonal entries are calculated like this:

G ====> G ====> -1 = -1 T T G T ====> G T - ====> -1 + 1 - 1 = -1 T C - T - G T C ====> G T C - ====> -1 + 1 + 1 - 1 = 0 T C A - T C A G T C C ====> G T C C - ====> -1 + 1 + 1 - 1 - 1 = -1 T C A T - T C A T G T C C A ====> G T C C - A ====> -1 + 1 + 1 - 1 - 1 + 1 = 0 T C A T A - T C A T A G T C C A T ====> G T C C - A T ====> -1 + 1 + 1 - 1 - 1 + 1 + 1 = 1 T C A T A T - T C A T A T G T C C A T A ====> G T C C A T A - - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 - 1 = 1 T C A T A T C - T C - A T A T C G T C C A T A C ====> G T C C A T A - C - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 + 1 - 1 = 2 T C A T A T C A - T C - A T A T C A G T C C A T A C A ====> G T C C A T A - C A - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 + 1 + 1 - 1 = 3 T C A T A T C A G - T C - A T A T C A G

Python implementation

Now – how do we actually implement this in Python? First, let’s initialize our table, which we’ll represent as a NumPy array. We’ll also define a function GET_SCORE that will test if two input nucleotides are equal or not – returning a reward (default 1) if they are and returning a penalty (default -1) if they are not.

import numpy as np X = "GTCCATACA" Y = "TCATATCAG" def GET_SCORE(n1, n2, penalty = -1, reward = 1): if n1 == n2: return reward else: return penalty # initialize score matrix score_matrix = np.ndarray((len(X) + 1, len(Y) + 1)) for i in range(len(X) + 1): score_matrix[i, 0] = penalty * i for j in range(len(Y) + 1): score_matrix[0, j] = penalty * j

Calculating the value of each cell

Next, we need to set the values of each of the remaining cells. The value of each cell is recursively based upon 1) the cells to the immediate left, above, and diagonally left; 2) whether there’s a match or gap between the nucleotides being compared (one from each sequence).

More specifically, the value of the cell in the i,j position of score_matrix (i.e. score_matrix[i, j]) is calculated by taking the max of the following:



score_matrix[i – 1, j – 1] + GET_SCORE(X[i – 1], Y[j – 1])

score_matrix[i, j – 1] + penalty

score_matrix[i – 1, j] + penalty

The indexes can be a bit confusing here, so let’s pause for a moment.

Remember our matrix is (len(X) + 1) x (len(Y) + 1)

X and Y represent the two respective DNA sequences.