Pairwise Sequence Alignment

This tutorial discusses computing pairwise optimal string alignments.  Sequence alignment and scoring is important to modern biological research.  Genomic areas such as gene identification, annotation, and the identification of gene regulation depend on the availability of accurate and efficient sequence alignment algorithms. Sequence alignment algorithms attempt to capture gene mutation, selection, and drift.  According to standard (simplified) models of gene mutation, there are more than 10700 different alignments of between two 103 character sequences. [2]  Fortunately, dynamic programming techniques allow both the computation and space complexity to be reduced to 106 for our two 103 sequences and in general to O(mn), where m and n are the respective lengths of the sequences.  In practice certain techniques and heuristics reduce the computation and space complexity even further (without guaranteeing correctness).  String alignment with dynamic programming that guarantees the computation of an optimal solution but is a stochastic process and there may be multiple equally strong alignments.

A dynamic visualization of the alignment process is available here.

Alignment Computation Visualization

Similarity Algorithms 

Pairwise alignment is most often described in terms of edit distance or similarity.  These are opposite and interchangeable notions.  In edit distance the score increases with the number of changes required to transform one sequence into the other, so a smaller score indicates stronger similarity between the sequences. Similarity follows the idea that the more matched characters between two strings (without changing the order) the larger the positive value for an alignment score.  Here we focus on similarity algorithms.  In the most straightforward similarity scoring model, LCS , each match between aligned string characters contributes 1 to the final similarity score and there is no penalty for inserted spaces or mismatches, e.g. S1 ATTG and S2 CTTG would have a final score V(n, m) = 3.

Global Alignment (Needleman-Wunch)

Definition: V(i, j) is defined as the value of the optimal alignment of prefixes S1[1..i] and S2[1..j]. [1] 

The algorithm progresses by aligning increasing length prefixes.  In order to remember prefix alignments we keep scores in matrix format. 

We use notation M(X, X) as a valid matrix cell position M(0, 0) to M(m, n),  n = length of S1 and m = length of S2.

We say that scoring function s(S1(i), S2(j)) equals something positive, a reward, if S1(i) and S2(j) are the same non-gapped (not '_') character and something non-positive, a punishment, otherwise.
Scoring parameters may be table driven, for example when considering Watson-Crick pairings a G-G match may be worth more than an A-A match or an A-C mismatch could be penalized more than a A-T mismatch. Setting scoring parameters is a focus of ongoing research. There is an interaction between laboratory observations and parameter tuning that draws from diverse fields such as biology, computer science, statistics, mathematics, and systems modeling. Common amino acid scoring matrix values as in PAM and BLOSUM depend on evolutionary distance. [1]  In our alignment discussion scoring parameters will be constant integer values, e.g. match = 2, mismatch = -1, and indel = -2.

Scoring

Example: S1 =  GAATTCAGTTA, S2 = GGATCGA, match = 2, mismatch = -1, indel = -2

An Optimal Alignment:
G A A T T C A G T T A
G G A T _ CG _  _ A 

Note that in global alignments both sequences are found in the final alignment in their entirety.


Initialization Base Case:   

Global Similarity Recurrence:


             V(i, j) = maximum {  V(i - 1, j - 1) + s(S1(i), S2(j)),
                                                V(i - 1, j) + s(S1(i), _),
                                                V(i, j - 1) + s(_, S2(j)) }

The intuition behind this recurrence is as follows.  When comparing two sequences S1 and S2, consider the next characters to be aligned S1[i + 1] and S2[j + 1] respectively, so that S1 is composed of a prefix S1[1..i] followed by  S1[i + 1] and S2 is composed of a prefix S2[1..j] followed by S2[j + 1] . If the characters S1[i + 1] and S2[j + 1] are the same, then it never hurts to match them and continue matching the two sequences.  If the characters are not the same, then it is not possible for both characters to be matched in the alignment.  Assuming that one of the characters will not be matched, V(i, j) is equal to the maximum value calculated for the three cases: where S1[i + 1] and S2[j + 1] are aligned but not matched, or a space '_' is inserted between S1[i] and S1[i + 1] and the character S1[i + 1] (now S1[i + 2]) is available to be matched in S2, or a space is inserted between S2[j] and S2[j + 1]  leaving S2[j + 1] (now S2[j + 2]) is available to be matched against S1. Since only one of these cases is possible for a given value V(i, j) we don't have to worry about aligning a space against a space, as mentioned, only one of these characters may be matched if a space is inserted.

Traceback:

The global alignment is computed by following the traceback pointers from M(m, n) to M(0, 0).


Longest Common Subsequence

One important case of global similarity is the longest common subsequence (LCS) problem. In Computer Science a subsequence is informally described as a subset of characters in a sequence arranged in their original order. So ATTG, AT, AG, and ATG are examples of subsequences of ATTTG but GT and TAG are not. The subsequences need not be contiguous characters in the original string while in a substring they must. In defining scoring parameters the LCS model of similarity guarantees a maximal value of alignment in terms of the number of characters that are aligned between the two sequences.

Scoring Parameters:

In LCS the scoring parameters are defined as:
match = 1
mismatch = 0
indel = 0

The algorithm for LCS is possibly the simplest variant of the Needlman-Wunch global alignment. The simple integer values used for LCS scoring make V(i, j) simple and intuitive. The value V(n, m) will equal the value of M(m, n) which will also be equal to the number of characters in the longest common subsequence.

Example: S1 =  GAATTCAGTTA, S2 = GGATCGA, match = 1, mismatch = 0, indel = 0

An Optimal Alignment:
G A A T T C A G T T A
G G A  _ T  _C  _ G _ _ A

V(n, m) = 6, M(m, n) = 6, LCS = GATCGA

In LCS, a global alignment, both sequences are found in the alignment in their entirety.


End-space Free Variant

The ends-space free variant of similarity computation is a way to remove the penalty for mismatched overlapping prefixes and/or suffixes of sequences S1 or S2. It is not a global alignment as the mismatched overlapping ends will not appear in the final alignment.  This algorithm is useful for identifying a region in a long sequence that is the similar to a shorter sequence.  For example, say that a long sequence is known to include a gene homologue.  Then the ends-space free similarity computation will identify the region of greatest similarity while not penalizing for the unrelated prefix and suffix.  Ends-space free similarity alignment is a first step towards local alignment.

Initialization base case:

 V(X, 0) = V(0, X) = 0

Traceback:

The global alignment is computed by following traceback pointers from MAX [M(X, n), M(m, X)] to M(0, 0). 

The prefix penalty is removed by setting M(0, X) and M(X, 0) = 0 in the dynamic programming matrix fill initialization. The suffix penalty is removed by starting traceback at the maximum value in row n or column m.

Example: S1 =  GAATTCAGTTA, S2 = GGATCGA, match = 2, mismatch = -1, indel = -2

An Optimal Alignment:
G A A T T CA
G G A T _ C G A

In end-space free variant either suffix and/or prefix may be disregarded in the final alignment without penalty. In the example note that suffix S1[8..n] = GTTA  is not found in the alignment. 


Local Alignment (Smith-Waterman)

Local alignment seeks to find a region , perhaps in the interior, of two sequences that have high similarity. The global alignment models described above compute maximal alignment over the entire sequence length. The second special similarity case, end-space free variant does not penalize contiguous prefix or suffix mismatches in S1 or S2 when computing V(i, j). Local alignment algorithm develops this theme further by not penalizing dissimilar contiguous sequences in both prefixes and both suffixes of S1 and S2.  Local alignment prevents the loss of meaningful alignment data in the case where a short sequence is aligned against a larger sequence (as in a genomic database search). An LCS type global alignment would likely give a V(n, m) equal to the length of the shorter sequence but no meaningful inferences could be gained.  In practice numerous exact matching and heuristics are used in popular alignment tools such as FASTA and BLAST.

Initialization base case:

V(X, 0) = V(0, X) = 0 

Local Alignment Recurrence:


             V(i, j) = maximum {  0,
                                               V(i - 1, j - 1) + s(S1(i), S2(j)),
                                               V(i - 1, j) + s(S1(i), _),
                                               V(i, j - 1) + s(_, S2(j)) }

 

Traceback:

The local alignment is computed by following traceback pointers from MAX [M(X, X)] to V(i, j) = 0.

By starting traceback at the MAX [M(X, X)] value dissimilar suffixes in both sequences are disregarded. The prefixes of both sequences are not penalized by, first, allowing 0 to be an option during the calculation V(i, j) and, second, by terminating the alignment when a valid traceback path encounters the value 0.

Example: S1 = TGAATTCAGTTA, S2 = CGGATCGA, match = 2, mismatch = -4, indel = -1

 A Local Alignment:
G A A T T C A G
G A _  T _ CG

In local alignment both suffix(es) and/or prefix(es) may be disregarded in the final alignment without penalty. In the example note that prefixes S1[1] = T, S2[1..2] = CG; and suffixes S1[9..n] = TTA, S2[m] = A are not found in the final alignment.


Alignment with Gaps

 Alignment with gaps (contiguous spaces '_') allows us to assign an incremental gap extension cost.  In genomic applications especially, it has long been held that one long gap is roughly as likely as another. State of the art gap scoring assigns a gap cost that is proportional to the log of the gap length. Thus, starting a gap is much more expensive than extending it.  Algorithms for alignment with gaps are computationally more expensive that algorithms for alignments without gaps.  To extend alignment algorithms to consider gap extension, the straightforward approach is to consider all gap lengths, instead of the single character gap considered by algorithms for alignments without gaps.  The resulting algorithm considers O(m+n) possible scores for each matrix position, resulting in a O(n3) time complexity. Faster algorithms are known when gap cost have special properties.

Initialization Base Case:

The initialization may be V(X, 0) = V(0, X) = 0 if we free the prefix end space or may be dependant on a given weighting system such that:

    V(i, 0) = w(i), V(0, j) = w(j), E(i, 0) = w(i), F(0, j) = w(j), and G(0, 0) = 0

Recurrence for Arbitrary Gap Weights:

The recurrence is significantly different from the other similarity models encountered. V(i, j) is still defined as the value of the optimal alignment of prefixes S1[1..i] and S2[1..j], but the algorithm values E(i, j) and F(i, j) are looking for a maximal value in column or row that would indicate the start of a gap in S1 or S2 (should E or F be larger than the value of the diagonal plus the match/mismatch value).

Traceback:

Traceback proceeds as with the other similarity cases. Its starting cell is not defined as in local alignment or the ends-space free variant. Traceback may proceed from M(m, n) to M(0, 0), or if it is desired to free one or both sequence ends traceback may begin at MAX (M(X, m), M(n, X)) and terminate as in ends-free or local cases.

Example: S1 =  ZXXABCDYYE, S2 = XXDEFQYY, match = 1, mismatch = -1, indel = -1 (global), w = 1 (Gaps)

An Optimal Alignment using Needleman-Wunch (global alignment):
Z X X A B C D Y Y E
_ X X D E  F Q Y Y _

An Optimal Alignment computed with Gaps using constant gap scoring (w = 1):
Z X X A B C D _  _ _ Y Y E
_ X X _  _ _  D E F Q Y Y _


References:
[1] Gusfield, Dan. Algorithms on String, Trees, and Sequences. Cambridge University Press 1997
[2] Waterman, Michael. Introduction to Computational Biology. Chapman & Hall 1995
[3] Corman, Thomas et al. Introduction to Algorithms 2ed. MIT Press 2001