The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.

In this article, a central problem in evolutionary and molecular biology, namely pairwise sequence alignment, will be discussed. In particular, let’s reason about the structure of the problem, turn it into an optimization problem, and investigate two attempts at solving it. Then conclude with assessing the significance of alignments computed. The dynamic programming algorithmic techniques will be used for solving the problems. The following figures show the DP algorithms to be used for sequence alignments.

Let’s compute the global alignment between the sequences X = ACCT and Y = TACGGT with the following scoring matrix:



The following animation shows the global pairwise alignment between the sequences.





Problems



The following figure shows the amino acid sequences that form the eyeless proteins in the human and fruit fly genomes, respectively. The scoring matrix PAM50 is shown below, which is defined over the alphabet {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,B,Z,X,-} which represents all possible amino acids and gaps (the “dashes” in the alignment).

Next, let’s compute the local alignments of the sequences of HumanEyelessProtein and FruitflyEyelessProtein using the PAM50 scoring matrix and enter the score and local alignments for these two sequences below. The next figure shows the heatmap of the local alignment matrix computed.

The local alignment score computed is 875 and the following figure shows the local alignments of the two sequences.

To continue our investigation, we next consider the similarity of the two sequences in the local alignment computed in the last question to a third sequence, a “consensus” sequence of the PAX domain: that is, the sequence of amino acids in the PAX domain in any organism. In this problem, we will compare each of the two sequences of the local alignment computed earlier to this consensus sequence to determine whether they correspond to the PAX domain. The following image shows the consensus sequence.

Let’s first compute the global alignment of this dash-less sequence with the ConsensusPAXDomain sequence. Then let’s compare corresponding elements of these two globally-aligned sequences (local vs. consensus) and compute the percentage of elements in these two sequences that agree. To reiterate, we need to compute the global alignments of local human vs. consensus PAX domain as well as local fruitfly vs. consensus PAX domain. The following images show the heatmaps of the global alignment matrices.

Is it likely that the level of similarity exhibited by the answers could have been due to chance? For the local human vs. consensus PAX domain alignment, the chance of 97 corresponding elements matching between two 133 element sequences whose elements are chosen at random from a 23 character alphabet is less than 10(−100)10(−100). So, it’s very highly unlikely that the level of similarity exhibited by the two original sequences HumanEyelessProtein and FruitflyEyelessProtein could have been due to chance. Hypothesis testing One weakness of our last approach was that we assumed that the probability of any particular amino acid appearing at a particular location in a protein was equal. In the next two questions, we will consider a more mathematical approach to answering the above question that avoids this assumption. In particular, we will take an approach known as statistical hypothesis testing to determine whether the local alignments computed are statistically significant (that is, that the probability that they could have arisen by chance is extremely small). Write a function generate_null_distribution(seq_x, seq_y, scoring_matrix, num_trials) that takes as input two sequences seq_x and seq_y, a scoring matrix scoring_matrix M, and a number of trials num_trials. This function should return a dictionary scoring_distribution that represents an un-normalized distribution generated by performing the following process num_trials times: Generate a random permutation rand_y of the sequence seq_y using random.shuffle(). Compute the maximum value score for the local alignment of seq_x and rand_y using the score matrix scoring_matrix . Increment the entry score in the dictionary scoring_distribution by one. Use the function generate_null_distribution to create a distribution with trials using the protein sequences HumanEyelessProtein and FruitflyEyelessProtein (using the PAM50 scoring matrix). Then, create a bar plot of the normalized version of this distribution.