Pairwise Alignment Introduction

What is Pairwise Alignment?

Pairwise alignment is the process of aligning two DNA, RNA or protein sequences such that the regions of similarity are maximized. This is often performed to find functional, structural or evolutionary commonalities.

In most cases, scientists use two protein sequences to quantitatively find relatedness (aka homology ). With this, they are able to identify common domains and motifs, and sequence ancestry.

Domains and Motifs Domains are parts of a DNA or amino acid strand that code for a physiochemically similar feature as found in other sequences and proteins. Domains refer to specific functionalities. For example, you could have a ATP-binding domain or polar domain. Motifs are similar, but reference the structural characteristics rather than functional regions. Motifs are often found in domains, although that's not always the case.

Protein vs. DNA sequence alignment

Protein amino acid sequences are preferred over DNA sequences for a list of reasons.

Protein residues are more informative - a change in DNA (especially the 3 rd position) does not necessarily change the AA.

position) does not necessarily change the AA. The larger number of amino acids than nucleic acids makes it easier to find significance.

Some amino acids share related biochemical properties, which can be accounted for when scoring multiple pairwise alignments.

Protein sequence comparisons can link back to over a billion years ago, whereas DNA sequence comparisons can only go back up to 600 mya. Thus, protein sequences are far better for evolutionary studies.

However, there are some obvious instances when DNA alignments are needed.

When confirming the identity of cDNA (forensic sequencing).

When studying noncoding regions of DNA. These regions evolve at a faster rate than coding DNA, while mitochondrial noncoding DNA evolves even faster.

When studying DNA mutations.

When researching on very similar organisms such as Neanderthals and modern humans.

Biochemistry 101 Review

Before we move on, let's take a quick review on some elementary biochemistry and notations.

Nucleotide Codes

We're all familiar with the four nucleotide bases - however, there are other symbols used for more ambiguous nucleotides.

Symbol Meaning Explanation A A Adenine C C Cytosine G G Guanine T T Thymine R A or G puRine Y C or T pYrimidine M A or C aMino K G or T Keto S C or G Strong interaction (3 bonds) W A or T Weak interaction (2 bonds) H A, C or T (not G) H is after G B C, G, or T (not A) B is after A V A, C or G (not T) V is after T and U D A, G or T (not C) D is after C N A, C, G or T aNything

A CG bond is stronger than an AT bond due to it having one more hydrogen bond. Source: Wikipedia

Amino Acid Residue Codes

Amino acids can be represented with one or three letters. Take some time to review these.

1-letterA 3-lettersA Amino Acid AA AlaA Alanine CA CysA Cysteine DA AspA Aspartic Acid EA GluA Glutamic Acid FA PheA Phenylaline GA GlyA Glycine HA HisA Histidine IA IleA Isoleucine KA LysA Lysine LA LeuA Leucine MA MetA Methionine NA AsnA Asparagine OA PylA Pyrrolysine PA ProA Proline QA GlnA Glutamine RA ArgA Arginine SA SerA Serine TA ThrA Threonine UA SecA Selenocysteine VA ValA Valine WA TrpA Tryptophan XA XaaA Undetermined YA TyrA Tyrosine ZA GlnA Glutamic acid or glutamine

Amino Acids license plate game A good tip to memorizing these is to play the amino acids license plate game! Keep a printout of the following table. When you and your cool friends are out for a drive, try to translate each license plate letter into amino acids. Sounds nerdy, but very effective in learning. Bonus points for knowing the properties and/or structures!

Amino acids grouping

There are several ways to group amino acids, depending on their functionalities and biochemical properties.

Amino acids and their biochemical properties. From Wikipedia.

With nonpolar (hydrophobic) side chains: alanine, valine, leucine, isoleucine, proline, methionine, phenylaline, tryptophan

With uncharged polar side chains: tyrosine, asparagine, glutamine, glycine, serine, threnine, cystein

With positively charged side chains: histidine, lysine, arginine

With negatively charged side chains: aspartic acid, glutamic acid

Homology - a qualitative measure

Richard Owen, an English biologist who lived from 1804 to 1892, introduced the term homology , stating that it is "the same organ in different animals under every variety of form and function."

Today, we use the term homology is used to characterize biological species that share a common evolutionary ancestory.

Strictly a qualitative measure! Note that homology is a binary qualitative measure - either two sequences are homologous or not. Homology cannot be quantified. Thus, it is incorrect to claim that "two sequences are 55% homologous" - we use percent identity or similarity to quote numbers, which we will talk about in the next lesson.

Orthologous vs. Paralogous

There are two subclasses of homology - orthologous and paralogous.

Paralogous is when gene duplication occurs, but both copies descend side-by-side during the history of the organism (para = in parallel). This phenomenon occurs within the species. For example, human alpha-1 globin is paralogous to alpha-2 globin because they resulted from a gene duplication that arose from a single organism. Paralogous genes are assumed to carry common functions.

When speciation occurs, and a gene is inherited in both species, these sequences are said to be orthologous (ortho = exact). For example, the human and rat myoglobins are orthologous - the sequence that codes for this protein comes from a common ancestral gene.

Further explanation

In simplified terms, orthology is the homology between species, while parology is the homology within species. If you see the figure below from Jensen et al., the ancestral gene has two copies of a particular gene (A and B). Relative to the ancestral genome, this is considered paralogs since they are within its own species.

The difference between orthology and parology.

However, once speciation occurs, two copies of genes A and B are created. Since the duplication of A and B genes occurred before speciation, A and B genes are still considered paralogs. However, genes A1 and A2 are orthologs, as are B1 and B2.

In the second case, gene duplication occurs after speciation. Thus, A2 and B2 are orthologs of A1, while A2 and B2 are paralogs of each other.

References

R.A. Jensen, Orthologs and paralogs - we need to get it right, Genome Biology, 2(8), 2001

Richard Owen Biography. UC Berkeley.

Identity and Similarity - a quantitative measure

To assess the similarity between two proteins, we first perform pairwise alignments. Pairwise alignment algorithms find the optimal alignment between two sequences including gaps. There are several algorithms that perform this including BLAST, FASTA and LALIGN.

After an alignment is made, we can extract two quantitative parameters from each pairwise comparison - identity and similarity.

Identity

Identity defines the percentage of amino acids (or nucleotides) with a direct match in the alignment.

What about some residues that aren't quite exact, but very similar? As you may recall from biochemistry 101, many residues are similar biochemically, structurally or functionally. To account for this, we can use the similarity quantifier.

Similarity (aka Positives)

When one amino acid is mutated to a similar residue such that the physiochemical properties are preserved, a conservative substitution is said to have occurred. For example, a change from arginine to lysine maintains the +1 positive charge. This is far more likely to be acceptable since the two residues are similar in property and won't compromise the translated protein.

Thus, percent similarity of two sequences is the sum of both identical and similar matches (residues that have undergone conservative substitution). Similarity measurements are dependent on the criteria of how two amino acid residues are to each other.

Similarity = Positives Keep in mind that on a BLAST search, similarity is also known as Positives !

What it looks like in BLAST

In a BLAST search, part of your results will come out like so:

...ARFSGTWYAMAK...: .||||.:| ...QKVAGTWYSLAM...

From this diagram, we can see periods (.), colons (:) and a vertical pipe (|). The periods mean the residues are somewhat similar, while colon mean they are very similar. A vertical pipe signifies a direct match.

Another notations commonly encountered is using a + sign instead of :, and letter for the matching residue instead of |. For example.

Calculations

Let's look at some a quick example to see how identity and similarity are calculated.

Say Sequence A has 320 AA, while Sequence B has 450 AA. Using BLAST to perform a pairwise alignment, we see that 100 amino acids are identical. % identity is

Identity = 100 / 320 = 31.25%.

We always use the smaller sequence length as the denominator.

Additionally, we see that there are 23 amino acids that are different by conservation substitution, meaning that their chemical properties are maintained.

To calculate similarity (a.k.a. positives),

Similarity = (100 + 23) / 320 = 38.44%

Thus, our sequences are 31.25% identical and 38.44% similar. Similarity is always greater than identity. Can you see why?

A quick look at BLAST

BLAST (Basic Local Alignment Search Tool) serves two purposes:

Align two sequences and look for homology Search a sequence in a database to find similar and related sequences.

Without diving into too much details about BLAST (which we will cover in a later series), let's perform a simple query to get a feel for how to use it.

There are several types of BLAST that depend on what your query sequence is (DNA or protein) and what you want to match it to. For this run, let's stick with blastp, in which you enter a protein sequence and it matches to a similar protein sequence from a database.

Using BLAST

The first thing to do is to go to the NCBI page for BLAST. From here, click protein blast (blastp), which is located under "basic BLAST."

You should get a window that looks like this:

BLAST website for protein BLAST (blastp).

1) Running a query against a database

We can search entire databases with a query. The query can be inputted with an accession number, gi (think of these as ID's for a specific protein sequence) or FASTA format.

What is FASTA? FASTA format simply has the first line beginning with a > that describes the sequence. Any following lines are the protein sequence itself. For example: >gi|293651548|ref|NP_001170841.1| nuclear factor NF-kappa-B p100 subunit isoform b [Mus musculus] MDNCYDPGLDGIPEYDDFEFSPSIVEPKDPAPETADGPYLVIVEQPKQRGFRFRYGCEGPSHGGLPGASS EKGRKTYPTVKICNYEGPAKIEVDLVTHSDPPRAHAHSLVGKQCSELGVCAVSVGPKDMTAQFNNLGVLH VTKKNMMEIMIQKLQRQRLRSKPQGLTEAERRELEQEAKELKKVMDLSIVRLRFSAFLRASDGSFSLPLK PVISQPIHDSKSPGASNLKISRMDKTAGSVRGGDEVYLLCDKVQKDDIEVRFYEDDENGWQAFGDFSPTD VHKQYAIVFRTPPYHKMKIERPVTVFLQLKRKRGGDVSDSKQFTYYPLVEDKEEVQRKRRKALPTFSQPF GGGSHMGGGSGGSAGGYGGAGGGGSLGFFSSSLAYNPYQSGAAPMGCYPGGGGGAQMAGSRRDTDAGEGA EEPRTPPEAPQGEPQALDTLQRAREYNARLFGLAQRSARALLDYGVTADARALLAGQRHLLMAQDENGDT PLHLAIIHGQTGVIEQIAHVIYHAQYLGVINLTNHLHQTPLHLAVITGQTRVVSFLLQVGADPTLLDRHG DSALHLALRAGAAAPELLQALLRSGAHAVPQILHMPDFEGLYPVHLAVHARSPECLDLLVDCGAEVEAPE RQGGRTALHLATEMEELGLVTHLVTKLHANVNARTFAGNTPLHLAAGLGSPTLTRLLLKAGADIHAENEE PLCPLPSPSTSGSDSDSEGPERDTQRNFRGHTPLDLTCSTKVKTLLLNAAQNTTEPPLAPPSPAGPGLSL GDAALQNLEQLLDGPEAQGSWAELAERLGLRSLVDTYRKTPSPSGSLLRSYKLAGGDLVGLLEALSDMGL HEGVRLLKGPETRDKLPSTAEVKEDSAYGSQSVEQEAEKLCPPPEPPGGLCHGHPQPQVH

Try inputting the above FASTA sequence.

Do not check the "Align two or more sequences" options.

Select the "non-redundant protein sequences (nr)" for the database.

For the organism name, use "human (taxid:9606)".

You'll notice that there are different types of BLAST you can perform - PSI-BLAST, PHI-BLAST and DELTA-BLAST. We'll cover these advanced BLAST variations in a later lesson.

There is also another window down at the bottom for Algorithm parameters, where you can fiddle with the scoring matrix, different gap penalties and more. But for now, click the big BLAST button to run your sequence!

A quick BLAST run.

After waiting for your query to be processed...Great! You just ran a BLAST search! Looks like you just found yourself the human ortholog of a mouse protein.

Scroll down to the bottom to the Descriptions panel, and you can see all the matches that are similar to your query.

Results of the best scoring matches on top.

You can scroll further down to see the actual alignments with the Identities and Similarities (called Positives) scores next to them.

Scroll down to see the top-scoring alignment with its identity and similarity (positives) scores.

2) Running a pairwise comparison

The other use of BLAST is for pairwise comparisons. This means you aren't querying a database, but just inputting two sequences and seeing how well they match up. To switch to pairwise comparison mode, click the "Align two or more sequences" option.

For the two sequences here, let's use gi|293651548 and gi|158256336.

A simple pairwise alignment with two proteins, given by their GI's.

Click the big BLAST button once again and wait for your query to be processed. Then scroll down and check your results.

In the Descriptions section there is just one alignment...but why are there multiple in the Alignments section? This is simply because there are several ways that BLAST can align your sequences. The top-scoring alignments are found on the top, while lower-scoring ones are at the bottom. For the most part, you'll want to look at the top-most result.

Results for a pairwise alignment run.

Wondering how the scoring system goes? We'll see that in the next few pages!

Dayhoff Model & accepted point mutations (PAMs)

We can see that BLAST is able to align two sequences - but how does it pick the best sequence? To answer this, we have to look at scoring matrices , which assign a score to each gap or residue alignment.

If we had the following alignment, what score would it have? How would it assign residues are similar vs. identical?

...ARFSGTWYAMAK... : .||||.:| ...QKVAGTWYSLAM...

Considerations for a scoring matrix

One key point to notice is that the substitution for one amino acid can be more physiochemically accepted than another. For example, arginine mutating into lysine isn't that bad since both have electrically charged side chains. However, if arginine mutated to glutamic acid, the charge would be changed from +1 to -1. Such a drastic change may render the protein useless!

Lysine and Arginine both have a positively charged side chain (+1).

We also must account that some amino acids are more commonly available from DNA. For example, serine has four different codon sequences, while tryptophan only has one, making it statistically more probably for serines to show up.

There are two amino acid substitution matrices that help score alignments. The first is the PAM substitution matrix, which is based on the rate of divergence between species. BLOSUM, an alternative to PAM, is based on the conservation of domains in proteins.

Let's first discuss PAMs.

Accepted Point Mutations (PAM)

In 1978, Dayhoff and her group came up with the Accepted Point Mutations (pronounced PAM since it's easier to pronounce than APM). In short, a PAM is the muation of an amino acid residue that is accepted by natural selection.

To see which amino acids are accepted in protein evolution, Dayhoff et al. examined 1572 changes in 71 groups of closely related proteins (shared at least 85% identity), and observed all amino acid substitutions. Thus, this experiment was based solely on observation among closely related species.

Relative Mutability

The Dayhoff group calculated the relative mutability ( R ij ) per amino acid.

R ij = M ij / f i

Here, M ij is the probability of residue j changing to i in a given evolutionary interval. The denominator f i represents the frequency of residue i occurring by chance.

Thus, the mutability index is an odds ratio, which is an indicator of how much more authentic the mutation is than it occuring by chance.

With this, Dayhoff et al., generated a table listing the relative mutabilities of amino acids, normalized to integers:

Asn 134 Ser 120 Asp 106 ... ... Leu 40 Cys 20 Trp 18

They found that Asn, Ser, and Asp were most likely to mutate, while Leu, Cys and Trp were least likely.

Normalized frequencies of AA

Dayhoff also tallied up the frequencies of AA's across all proteins. If all amino acids were equally probable in protein sequences, the would all have frequencies of 1.00/20 = 0.05 , but they don't.

Gly 0.089 Ala 0.087 Leu 0.085 ... ... Tyr 0.030 Met 0.015 Trp 0.010

Mutation rates vary

As you can see, mutation rates vary across the amino acids, and Dayhoff found an empirical way to normalize for variations in AA composition and mutation rate. From Dayhoff's experiment, we can see the characteristics of mutable and less mutable residues.

More mutable residues:

are much easier to replace. Asparagine is part of the uncharged polar amino acids group, which contains six other amino acids with similar properties

Less mutable residues:

serve important functions. If it has a particular charge, the protein will most likely need that charge to stay there.

are difficult to replace. Take a look at Tryptophan's unique structure below and you'll see why it'd be so difficult to replace it!

Tryptophan structure

PAM matrices

With these results, Dayhoff built a 20 x 20 mutation probability matrix that tallied a score per each amino acid change. This was known as the PAM1 matrix, which showed the probabilities of proteins undergoing 1% change (1 accepted point mutation per 100 amino acid residues). The PAM1 matrix is based on the alignment of protein sequences that shared at least 85% identity.

PAM1 mutation probabilty matrix from the Dayhoff group. All values multipled by 10,000.

One important assumption made in the generation of PAM matrices are that each change in amino acid is assumed to be independent of previous mutational events at that site. This type of assumption shows that PAM is a Markov model .

So why is this assumption important? From this, we can generate more PAM matrices to be used for sequences that are separated by longer periods of evolutionary history. For example, a PAM250 matrix is simply a PAM1 matrix multipled by itself 250 times. This matrix applies to alignments that share about 20% amino acid identity, which represents about 2500 million years of evolution.

A PAM250 matrix. Each column has been adjusted so that the columns sum to 100.

Probabilty to Log-Odds Scoring

When BLAST scores an alignment, it doesn't use the probability matrix as seen above. It converts the matrix elements into integer numbers and produces a log-odds scoring matrix.

s i,j = 10 * log (q i,j / p i )

Here, s i,j is score for aligning any two residues. Simply put, if the value is high, that means it aligns well. The q i,j is the amount a certain residue i would mutate to residue j . The denominator is the probability of the residue mutation occurring by chance.

Let's take the residue M (methionine) and calculate its mutation to L (leucine). Both of amino acids have a hydrophobic side chain, so they should align well - we expect a positive score. We will use the PAM250 Mutation Matrix

s M,M = 10 * log (q M,M / p M )

s M,M = 10 * log (0.06 / 0.015)

s M,M = 6

This is the value found in the log-odds matrix for PAM250. With this matrix, we can now score our alignments! But wait - what about gaps? Let's see how to handle those in the next page!

References

Dayhoff et al. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 345–352. National Biomedical Research Foundation, Silver Spring, MD, 1978.

Handling Gap penalties

We have already discussed amino acid substitutions, but what about the other two types mutations, insertions and deletions? Together, these insertions and deletions are known as gaps , and its scoring system varies.

Gap penalties for local vs. global alignment

Handling gap penalties vary depending on what you're looking for. If you are searching for similar domains, you shouldn't penalize much for any gaps that occur at the ends of your sequences (local alignment). However, if you need a more exact match (global alignment), then you would probably want any sequences that have long end gaps.

Opening and Extension gap penalties

Gap penalties may be broken down into two parts:

An opening of a gap. The extension of a gap.

An opening gap penalty is applied at the appearance of any gaps. For gaps that are longer than one residue, we can apply an extension gap penalty , in which penalization occurs for the addition of each residue length.

With this, there are several methodologies we may apply to a gap.

Types of gap penalties

Constant

This is when there are no opening gap penalties, and a fixed negative score is given to every gap.

AB--C----DEF AB--C----DEF ABGHCFGHIDEF

Here, we'd have a score of -2n , where n is the score we take for each gap.

Linear

This depends on the gap length, so a penalty score is assigned for every gap residue.

AB--C----DEF AB--C----DEF ABGHCFGHIDEF

Here, we'd have a score of -6l , where l is the score we take for residue gap.

Affine

The affine type takes into account the gap opening penalty, as well as each length. This means that on top of a linear penalty type, there is another penalty score added that stands for gap opening.

AB--C----DEF AB--C----DEF ABGHCFGHIDEF

Here, we would have two opening gap penalties of -2n and residue gaps of -6l where n is the penalty per opening gap and l is the penalty per residue in each gap.

BLAST uses the affine type by default. The opening gap penalty is -11, while each additional residue gap is -1. You may change these settings in the "Algorithm parameters" section, just below the BLAST button.

Changing gap penalties in BLAST

One thing to notice here - what is the default scoring matrix set on? It's not any PAM matrix, but it's rather BLOSUM62. Let's see how BLOSUMs are constructed, and the difference between them and PAMs in the next lesson.

BLOSUM - BLOcks SUbstitution Matrix

An alternative to the PAM matrix is BLOSUM (BLocks SUbstitution Matrix) , which was derived by Henikoff and Henikoff in 1992. NCBI uses BLOSUM62 as its the default matrix for protein BLAST.

Derivation of BLOSUM matrices

BLOSUM matrices are derived from comparisons of blocks of sequences from the Blocks database.

What are blocks and what is the blocks database?

A block is an ungapped multiple alignments of highly conserved, short regions. Here is what a sample block looks like:

Conserved blocks in alignment

The blocks database contains multiple alignments of conserved regions in protein families.

Derivation a BLOSUM matrix

The Henikoffs developed a database of "blocks" based on sequences with shared motifs. More than 2,000 blocks of aligned sequence segments were analyzed from more than 500 groups of related proteins. Within each block, they counted the relative frequencies of amino acids and their substitution probabilities

Why did they use Blocks?

The Henikoffs used blocks due to several reasons:

Need to have multiple alignments and it's easier to align with similar sequences.

They didn't want to complicate calculations with insertions/deletions.

Wanted to focus on conserved regions for computing the scoring matrix.

What does a BLOSUM matrix tell us?

A BLOSUM tells us the likelihood of occurrence of each pairwise substitution, and we can use these values to score a pairwise comparison.

Each scoring matrix is constructed based on how identical the ungapped multiple sequence alignments are. For example, BLOSUM62 is derived from blocks containing at most 62% identity in the ungapped sequence aligments.

Calculating a BLOSUM

Here we'll show you how to calculate a BLOSUM.

Removing redundancies

Before we start constructing a matrix BLOSUM r, we have to eliminate the sequences that are more than r% identical. This solves us from the bias we get from databases over-representing certain classes of proteins. To do this, we have two options:

Remove sequences from the block. Replace the similar sequences with a new sequence that represents the cluster.

ACD DCE DCE DCE BCE BCD ACB

Since most databases today have an over-representation of proteins, the extraneous DCE sequences should be eliminated in order to make our database more representative.

Calculating possible pairwise combinations

Thus, after elminating redundancies, we look at the first vertical column in our block:

A D B B A

Let's find out how many possible pairwise combinations we can see for each possible pair.

For the AA pair, we have 2 possible combinations, for AB or BA we have 4. For AD we have 2. We continue these calculations until the occurrence of all possible pairs are found.

Pair Column 1 score Column 2 score Column 3 score Total AA 1 0 0 1 AB or BA 4 0 0 4 AD or DA 2 0 0 2 BB 1 0 0 1 BD or DB 2 0 2 4 CC 0 10 0 10 DD 0 0 1 1 DE or ED 0 0 4 4 EE 0 0 1 1

Note that the total sum is 26, which we can use to normalize our matrix.

A B C D E A 1 B 4 1 C 0 0 10 D 2 4 0 1 E 0 0 0 4 1

Calculate scores

To obtain integer values for our scoring matrix, we need to find the score per cell. We can do this with the following equation:

s ij = log 2 (q ij /e ij )

Where q ij is observed frequency and e ij is the expected frequency.

q ij = c ij / T

c ij is the cell value as calculated above.

To calculate the Total T :

T = w * n(n-1) / 2

Where w is the number of columns and n is the number of sequences. With T , we can calculate q ij , which is the rate of change of residue i to residue j .

In our case, T = 30 , so let's divide all our cells by 30 .

A B C D E A 0.0333 B 0.133 0.0333 C 0 0 0.333 D 0.0667 0.133 0 0.0333 E 0 0 0 0.133 0.0333

Now p i can be found with the following equation:

p i = q ii + ∑(q ij /2)

p A = ( 1 + 6/2 ) / 30 = 0.133

p B = ( 1 + 8/2 ) / 30 = 0.167

p C = 10 / 30

p D = ( 1 + 10/2 ) /30 = 0.200

p E = ( 1 + 4/2 ) / 30 = 0.0133



The expected frequencies:

e ii = p i 2

e ij = 2p i p j (i ≠ j)

A B C D E A 0.0178 B 0.0444 0.0278 C ? ? 0.111 D 0.0533 0.0667 ? 0.04 E ? ? ? 0.04 0.01

Notice how I didn't calculate cell values that had a value of 0 - you'll see that we don't need these values in the actual scoring matrix.

Now we have all we need! Just plug in values from the two matrices above into the equation below to obtain our scoring matrix.

s ij = log 2 (q ij /e ij )

To obtain scores, we multiple s ij by two and round.

s ij = round (2 * log 2 (q ij /e ij ))

A B C D E A 0.9 B 1.58 0.0278 C 0 0 0.111 D 0.0533 0.0667 0 0.04 E 0 0 0 0.04 0.01

References

BLOSUM vs. PAM

As we have seen, there are two different log-odds scoring matrices available. When performing pairwise sequence alignments, you must specify the exact matrix to use - this depends on the suspected degree of identity between the query and its matches.

Let's do a quick summary of these two matrices and see how they differ.

PAM matrices summary

PAM (Accepted Point Mutations) matrices are:

Extrapolated using comparisons among closely related proteins based on an evolutionary model.

Based on data from alignments of closely related protein families.

Because they use the Markov model, the substitution probabilties for highly related proteins can be extrapolated to probabilties of distantly related proteins.

Some limitations of PAM matrices include the somewhat inaccuracy of the assumption model and each position being context dependent.

BLOSUM summary

Based on empirical observations of more distantly related protein alignments.

Not based on an evolutionary model - just empirically derived.

Sequence comparisons cover a broad range of divergences rather than just one subfamily of proteins.

Based on observed alignments from conserved regions of multiple sequence alignments (blocks). These blocks are from large protein databases, which give a large sampling site.

Limitations are that you are limited to a subset of conserved domains.

Global Alignment Needleman-Wunsch

The purpose of global alignment (aka optimal matching algorithm) is to align two sequences from start to end, and make as many matches as possible. Algorithms do this by inserting gaps within the letters of each sequence to maximize the number of matches. This technique is often used when comparing sequences of similar length.

Algorithm

The algorithm for global alignment was first published by Saul Needleman and Christian Wunsch (1970), then modified over the next few years by other scientists. It aligns sequences from beginning to end and finds the best alignment that maximizes the overall score. The score is calculated based on matches, mismatches, and gaps.

Dynamic Programming

The algorithm is based on an algorithmic technique known as dynamic programming , where the optimal alignment is reduced into a series of smaller problem. The solution to these smaller problems are then used to calculate the solution to the larger (or overall) problem.

Example

The best way to go through the algorithm is to show an example, so let's compare two sequences - IDENTITY and SIMILARITY.

1) Set up a 2d matrix

The first step is to set up a 2d matrix with the first sequence on the y-axis, and the second sequence on the x-axis.

2) Decide on a scoring system

Now we need to fill this grid with numerical values, which we will later use to score the matches. Let's come up with a formula to find the score per cell, which we can do with pre-determined values.

In the first option, if a match is present (cell will be colored green), s(x i , y i ) will be given a score of +1. If it is a mismatch, then the score will be -2. In the second option, a gap will be inserted in the first sequence. Thus, we take the cell left of the current cell and -2 from it. In the second option, a gap will be inserted in the second sequence. We take the cell on top of the current cell and -2 from it.

The maximum of these values will be taken and placed as the current cell's score.

3) Fill out the primary values

Start with filling a 0 at the top-left corner. Each cell along the row will be -2 since the only value it looks at is the cell to the left. Each cell along the first column will be -2 from its top, since that's the only value it has to look.

After coloring each matching letter green, you will have a table that looks like this:

4) Fill the rest of the table

We can see that for green cells (match), we take the maximum value of the left, top-left (diagonal), or top, and +1 to it. For any mismatches or gap insertions, we -2 from the largest value of the left, top-left (diagonal) and top cells.

A completed matrix gives the final score of the alignment on the bottom-right most cell. In this case, it's -7.

5) Trace the optimal path

Before we find trace the optimal path, let's go over some conventions.

Our purpose is to find the optimal path from the top left going to the bottom right. A diagonal line denotes that the letter is aligned (can be match or mismatch). A perfect match will have a diagonal going from top left to bottom right. Mismatches can still generate this diagonal, but their scores will be different.

Horizontal and vertical lines denote the presence of a gap in the first and second sequence, respectively. Gaps may be of any length, and may occur within (internal) or at the ends (terminal) of sequences.

For our example, we start from the bottom-right cell, then move our way up to the top-left.

With this, we get the alignment:

There are other alignments that give the same score. Can you trace them out?

6) Varying parameters

Depending on what kind of alignment you're looking for, you can vary the parameters. For example, some sequence alignment algorithms vary gap penalties in two subcategories. An opening gap would be given a more severe penalty than each continuing gap . In our example case, all gaps were treated equally with a penalty of -2.

Adding a harsher mismatch penalty can force more gaps to be created. However, this is non-ideal as it's good to have near-identical sequences with few deletions.

Additionally, global alignments usually use an end gap penalty. This is the distinguishing factor between global and local alignment, which we'll see next.

Local Alignment Smith-Waterman

Local alignments like global alignments, but they generate "islands" of areas that have the greatest similarity. This is helpful when the query and sequence are dissimilar, but are suspected to contain domains or small regions of similarity. The BLAST algorithm uses local alignment.

Local alignments differ from global alignments in a few ways:

No penalty for starting at an internal position.

Does not necessarily extend to ends of sequences.

No negative values on the matrix are allowed - zeros are used instead.

The power of end gap penalties

The Smith-Waterman algorithm is very much like the Needleman-Wunsch algorithm used in global alignments - the hallmark difference is in the scoring methodology.

Unlike global alignment, local alignments have no end gap penalties, allowing small interior alignments to rank higher when scored.

Let's take a quick look at the effects of end gap penalties. The following sequence is aligned globally, with high end gap penalties.

M - N A L S D R T M G S D R T T E T 6 -12 1 0 -3 1 0 -1 3 = -5

Now in this next sequence, we have a local alignment. Notice how the small region in the middle aligns quite nicely.

M N A L S D R T - - - - - M G S D R T T E T 0 0 -1 -4 2 4 6 3 0 0 0 = 10

Without the end gap penalty, the Smith-Waterman alignment algorithm is able to find the best locally matching sequence.

Example

Let's compare two sequences - CGTTCTA and AACGTTGG .

1) Set up a 2d matrix

Set up a 2d matrix, as we did earlier in the Needleman-Wunsch example.

Setting up our matrix.

2) Decide on a scoring system

We need separate scores for matches, mismatches and gaps.

Smith Waterman Formula

Any cell that would have a negative value are given 0 instead.

3) Fill out primary values

We want to start with the first row and column and gives those a value of 0 . Then we want to mark the cells that indicates a match.

Setting up our matrix

4) Fill out rest of table

Now we fill the rest of our table out. Make sure to keep track of where each cell value came from, as we need this to trace back our optimal alignment.

Filling out rest of the values.

Note that a mismatch or a match can only come from the cell diagonally up to the left of the current cell. Additionally, gaps may only come from the top or left of the current cell.

5) Trace the optimal path

Now all we need to do is retrace our steps. First, find the cell with the highest score.

Tracing back the optimal alignment.

Now we trace back until we get to a cell with 0 . Thus, our optimal local alignment becomes:

--CGTTCTA AACGTTGG-

Conclusion: Global vs. Local alignments

Thus, we may say that for global alignments, where the sequences are connected along the entire length of their sequences, there is a higher % identity with many small interior gaps. For local alignments, which focus on the best matching regions, there is a lower % identity, but fewer interior gaps and longer end gaps.