Professional Documents
Culture Documents
Alignment: Metric Quantifying Multiple
Alignment: Metric Quantifying Multiple
Sequence Alignment
Ken D. Nguyen and Yi Pan
Department of Computer Science
Georgia State University
PO Box 3994
Atlanta, Georgia 30303
Email: knguyen@cs.gsu.edu and pan@cs.gsu.edu
Abstract-Aligning multiple homologous protein sequences of the insertion. For example, let the sequences to be aligned
(MSA) helps biologists identify the relationship between speciesbe s, : GLISVT and S2 : GIVT, then a possible alignment
and possibly predict the structure and functionality of the G L I V S T
protein. However, optimally aligning multiple sequences has s/
is A(si, s2) = G
been proven to be intractable by Wang and Jiang in [1]. For
the last two decades, researchers have often taken different To align two sequences, dynamic programming [2], [3],
heuristic approaches to solve this problem without a consistent gives the optimal alignment of the sequence in 0(n2) time,
and reliable scoring method. In this paper, we have developed where n is the length of the sequences. Dynamic programming
a scoring metric (Hierarchical Expected matching Probability
[HEP]), that measures the probability of residue mutations and run-time and space grow exponentially proportional to the
the biological correctness of MSA results. Both theoretical and number of sequences to be aligned. Commonly, the MSA
manual selected test sequences have shown that our quantitative problem is solved by heuristics methods [4]-[10] [11], or by
metric is more reliable, consistent, and biologically meaningfulprobabilistic methods [12], [13]. The general approaches often
than many commonly used scoring metrics. fall into two categories: global alignment and local alignment.
I. INTRODUCTION In global alignment, overall score optimization is the central
force that makes an alignment span the entire length of the
In a protein sequence, conservation regions (i.e. sequence sequences. Local alignment focuses on aligning small regions
motifs) are groups of biologically significant amino acids that of the sequences that are believed to be conserved and have
are conserved throughout the evolution. Correctly detecting the a significant biological meaning, and this technique seldom
sequence motifs helps identify the evolutionary relationships yields an optimal alignment. Since each method is somewhat
between species or predict the structure and functionality of a different, each has its own scoring scheme to maximize its
protein. This task is very critical since proteins play various matching and alignment scores. These scoring schemes are
important roles in the body such as antibodies, enzymes, etc. biased, inconsistent and often do not reflect the biological
Protein may be associated with a particular disease; therefore, meaning of each conservation region of the sequence.
understanding the structure and functionality of the protein In all MSA algorithms, the basic step of assembling the
helps in finding a cure for such disease. The most com- result is qualitatively comparing the goodness of matches
mon way to detect sequence motifs is arranging homologous between two or more residues in a column. The placement of
sequences, or multiple sequences alignment (MSA), so that a residue into a column relies on this measurement. And the
similar conserved regions between these sequences are aligned correctness of the MSA result depends heavily on how reliable
into the same columns. However, optimally aligned conserved the quantify function is. Therefore, developing a scoring metric
regions across all the aligning sequences was proven to be for the MSA problem is as significant as aligning multiple
intractable by Wang and Jiang in [1]. sequences. This is the objective of our work.
When aligning the sequences, a gap (-) can be inserted into This paper is organized as follows: we will analyze various
the original sequences to maximize the number of columns of existing scoring methods; a quantitative measurement method,
similar type residues. The MSA can be generalized as follows: called Hierarchical Expected matching Probability (HEP) scor-
Let W = {ai,. ,i,... a1n},i > 1, be a set of amino ing metric, is proposed, a performance analysis of this new
acid symbols. A sequence si = .i ...j ok is a string method is presented followed by the conclusion and possible
n enhancements of the method.
of n amino acid symbols. An MSA is a transformation of
k sequences {s, s22, k into {1s, s... s'}, where s'
s} II. EXISTING SCORING METHODS
is si with GAP (-) symbols inserted such that maximizing In 2001, W. S. J. Valdar [14] comprehensively analyzed 18
the similar regions across the sequences. In general, when a different MSA scoring methods and categorized them into 6
gap is inserted into a sequence a penalty is applied to the groups based on their scoring characteristics and information
alignment. The weight of the penalty depends on the location they utilize while scoring. These groups are symbol frequency,
and deletions where d(si, sj) is the distance between sequence si and sj
5) Sequence weighting: accounting for sequences with dif- and is calculated as
ferent weights, and 1
6) Simplicity: easy to implement and use. d(si, sj) = 1 M (si (x), sj (x)), (5)
The summary of these 18 scoring methods and their per-
n X alignij xCEalign2j
formances on the test data are shown on Table II, where the where alignij is the set of all positions that manifest an
check-mark (V/) represents the capability to rank the column amino acid in one or both of si and sj, and n x alignij is
correctly, and the numbers in the last column represent the the size of this set.
mentioned requirements that are satisfied.
Among these 18 analyzed scoring methods, the Sum-of- Matrix M is a linear transformation of a substitution matrix
pair method is the most commonly used. Sum-of-pair scoring m such that all values in M are in range [0-1]. M is defined
method for aligning n sequences of max length n is defined as:
as: f max(a,b) -min(m)
n n b if a 7t gap and b :t gap
S(X) = E E (Ssi(), Sj2W), (1)
b 0, otherwise
i j7i
where si (x) is the amino acid at column x in the ith sequence,
(6)
Valdar and Thornton's method correctly ranks columns (j)
and S(a, b) is the score for aligning two amino acids a and b. and (k) in the test sequences shown in Table III, however,
However, defining an adequate gap penalty for this method this method inherits the inconsistency found in the sum-of-
is difficult. For instance, aligning a segment of sequence pair score.
si = ... MI... with pre-aligned sequences A = {s= An attempt to generalize a scoring method that quantifies
... LLM... and 5k =... IMI... } yields more than one
both residue conservation and divergence, called Trident [14],
alignment. If BLOSUM62 ( [15]) substitution matrix and a is as follows: for a position x, the conservation score is
gap penalty of -1 are used to score the alignments, alignment calculated as:
of
rs L L L SW (I t(x))'(' -,r(x))'(' g(x))
= .
(7)
A(si,sj,sk) j s: I M I yields a score of -5, where t is a function of normalized symbol diversity relating
M I to Shannan's entropy, r is a function of residue diversity, g
which is higher than the alignment of is a function of gap cost, and a, j3, and -y are variables.
rs L L L Trident method is far more complex than other methods due
A(si,Sj,s) jO s: I M I yielding a score of -8. to problems of selecting appropriate values for the variables,
s: M I and the complexity of the three embedded functions.
If, instead, the penalty score is chosen to be -10, both
of these alignments yield the same score. In addition, the III. HIERARCHICAL EXPECTED MATCHING PROBABILITY
SP scoring method calculates all-pair scores of residues in a SCORING METRIC (HEP)
column at every step of alignment, which increases the runtime To align k sequences of length n optimally, we explore k
of an MSA algorithm by 0(n2) for aligning n sequences dimensions of solution space getting the optimal solution in
of length n. Because it is simple, despite its inconsistency 0(nk) run time. Fundamentally, the probability of aligning a
and slow-down factor, SP combining with biological and residue a from sequence si to a residue b from sequence sj is
probabilistic weight factors is the most commonly used scoring p(a b) < 1. The probability of a residue from each sequence
method in MSA tools. aligned into one column k is P(k) = Hi'U(a b), P(k) -> 0.
Valdar and Thornton [14] attempted to resolve the limita- This probability decreases exponentially as the number of
tions of existing scoring methods by proposing a conservation aligning sequence increases. Therefore, the scoring function
of aligning sequences, maximizing the matching score, but Fig. 1. HEP scoring tree generated from BLOSUM62 substitution matrix.
is also about biologically aligning the meaningful "true" The matching cardinality of each amino acid class is shown at the bottom of
conservation regions (motifs) together. This aspect of MSA the corresponding class symbol.
should be decided by biologists, and a pure mathematical
metric such as scoring methods in the symbol frequency or Motif II
symbol entropy groups is generally insufficient. Furthermore,
MSA algorithms almost always assemble their alignment by NVVO: t i L D L g d a y
aligning a few (or few pairs) sequences at a time. A scoring Motif III
metric that gives a score only after the alignment has been
assembled is not sufficient. A good scoring metric should be SFV1: t t L D L t n g f L P Q L
flexible enough so that MSA algorithms are able to use it to
Fig. 2. Matching a Leucine from NNVO with a similar amino acid in
progressively align and assemble their alignments. We propose sequence SFV1. The dotted arrow shows a mismatch, the dashed arrow shows
a scoring metric, namely Hierarchical Expected matching a better match, and the solid arrow shows the best match. Sequences adapted
Probability (HEP), that quantifies residue matching divergence from [19].
via the amino acid class covering hierarchy (AACH) [11],
[16] and residue matching conservation via inverse random
matching probability function. The cardinalities for the AACH by nx (n-1)
2 , since there are nx
n (n- 1)-ws
2 possible pair-wise
are generated from any given biological meaningful scoring scores for a node with n distinct residues.
matrix, substitution matrix, or any set of quantitative values. 4) Calculate the divergence: a more diverged amino acid
class should have smaller matching score than the less
diverged ones. As we traverse to the root of the AACH
A. Quantifying Residue Divergence via AACH Scoring Tree tree, the divergence increases. Thus, we choose to divide
Scoring matrices and substitution matrices such as PAMs the AACH node scores by their levels to reflect their
(point of accepted mutation - [17], [18]) or BLOSUMs (blocks divergency.
substitution matrix - [15]) provide all pairwise matching scores 5) Adjusting the calculated scores by the root score: match-
for the amino acids. To quantify the divergence of two or more ing at the root, where all amino acids match, is biologi-
residue types, we use the average sum of all pairs scores. The cally irrelevant. Thus, we want the root score to be zero
details of the algorithm is as follows: to simplify future calculation. If the root score are not
1) Scale scoring/substitution matrix: the minimum score in zero, all calculated scores in the AACH tree are adjusted
a scoring matrix can be negative, zero, or positive. We by the root score.
want the minimum score starts at zero so that when we 6) Round up the score (optional): Round up to the nearest
calculate the probability, we would have a meaningful integer.
value (probability must be between 0 and 1]. One of the For example: The score of node c in the AACH tree using
way to do this is to subtract the minimum score from BLOSUM62 matrix is calculated as follows:
S(I,V)+S(I,L)+S(I,M)+S(VL)+S(VM)+S(L,M)1
every score in the matrix. For example, BLOSUM62 C
C-~~~~~~33X4(4-1)l
x
2
has a minimum score of -4 for matching cysteine and F(7 + 6 + 5 + 5 + 5 + 2)/18] 2, where S(a,b)
glutamate. We will subtract -4 from every value in the is the matching score between two residues a and b from
matrix; thus, score for matching cysteine and glutamate BLOSUM62 matrix. The pairwise matching scores showing
becomes zero, and all other matching score will be in this example have been scaled up by 4, (subtracting the
scaled up by 4. minimum score of -4 from every score in the matrix).
2) Scoring AACH leaves: the score of matching a residue
to itself, leaves of AACH tree, is extracted from the B. The scoring metric
scoring matrix. The score of an alignment is the summation of all column
3) Scoring AACH internal nodes: starting from the leaf scores in the alignment.
nodes, skip the leaves, to the root of the tree, calculate lAk
the sum of all pair-wise scores from the scoring matrix
for each internal node and divide the summed scored
Score(Ak) =JE cScore(i)
i=l
(8)
8 D D D E E E L W C NTC0 vlrKp--
ICD0 eipKp--
amLDGrnay
vdIDIk-gf
gVRQGmvl
gTPQGgil
aYLDDVtv
rYADDFki
alGIE-
rlDLDi
rVLGagv
dFLGfkl
9 D D D E E F V S R IAGO fkkKt-- ieGDIks -f gVPQGgii rYADDWlv elKITl -FLGvnl
10 D E F E F F V S H ICS0 wipKp-- ldADIsk-c gTPQGgvi rYADDFvi emGLE 1 nFLGfnv
IPLO yipKs-- leADIr-gf gVPQGgpi rYADDFvv srGLV1 dFVGfnf
Each label column represents a residue position in a multiple sequence align-
ment. Amino acids are identified by their one letter code column (k) comes Fig. 4. The RT OSM sequences. The six motifs of the RT OSM are indicated
from an alignment of 10 sequences where column (j) comes from an alignment by roman numeral(I-VI). The bold and capitalized letters represent the core
of only 4 sequences (a) > (b) > (c) > (d) > (e) > (f), (g) > (h) > (v), amino acids of each motif. Adapted from [19]
and (j) > (k).
the aligning sequences to guide the alignment toward known to a phenylalanine (F), the probability for all five residues
sequences. Normalizing and factorizing out the redundancy (either D or E) in column (d) mutating to other amino acids
in this case is not useful and may lead to unwanted results. H|5
iS (PDE)5 = PDE [where PDE is the probability that D is
Therefore, the weight factor should be from a biologist or a mutated into E], which is much smaller than the probability
function that measures the significance of a residue in a motif of residue D in column (c) mutating to residue F. This result
to a motif rather than the redundancy of similar sequences. If follows the decomposition of the BLOSUM62 substitution
the weight function is known, we can combine this information matrix and its log odd scoring function from [20]
into our scoring metric. The column score in Equation 9 will
be S(a, b) = Ilog fafb
A
Pab (13)
r 0, if column i only matches at tree root
cScore(i) = w(i) Y:Og1T'O1i1 |Tz (count(nodej)-1)i where S(a, b) is the score of aligning residues a and b,
I (k -)r 1=1 j=l ( A = 0.347 is a constant, and fi is the background frequency
of residue i. All amino acid background frequencies are
where w(i) is the the motif weight of residues in column i derived from the existing sequences [20]. With the background
and is calculated as frequencies fD = 0.0539, fE = 0.0539 and fF 0.469
r 1, iff residues are from a similar motif and obtained from [20], we found that PDE = 5.8E 3 and
are biological and locational order equivalent fDF = 9.OE - 4. Thus, H5PDE 6.56357E 12 <
w(i) a, iff residues are from similar motif and PDF= 9.OE-4. Thus, column (d) is more informative since it
indicates that F is conserved and possibly a core-motif residue
] are biological equivalent across nine out of the ten sequences, while column (d) fails to
3, otherwise indicate which residue is the most conserved residue among
(12) these sequences. Therefore, we rearrange these columns in the
where 0 < Q < a < 1. order of mutation probability.
In the next section, the performance of HEP scoring metric
is evaluated. B. RT-OSM sequences
IV. EVALUATION OF THE SCORING METRIC In 1999, Hudak and McClure [19] selected a set of twenty
reverse transcriptase sequences, shown in Figure 4, to test
We use two sequence sets that have been used widely
to valid scoring methods to test our scoring function. The
the reliability of seven commonly used MSA tools. These
theoretical data set is from [14] and the manual selected data sequences are called order-specific-motifs (OSM) because they
set is from [19].
contain a set of motifs occurring in a specific order among the
sequences.
A. Theoretical sequences
Table III includes the theoretical sequences proposed by C. Results
Valdar [14] for testing the goodness of MSA scoring functions. For the theoretical sequence set, we rank the sequence
In the original sequences, column (d) in Table I was columns using HEP and compare its results with predefined
considered having higher conservation score than column ranks. For the RT-OSM sequence set, we utilize available MSA
(c) without supporting justification. Even though the chance tools to align the sequences and then rank the result alignment
aspartic acid (D) will be mutated to glutamic acid (E) is with HEP. After all, we compare HEP result with our manual
greater than the chance that aspartic acid (D) will change ranking.
T -COFFEE > PIMA > DCA > DALIGN2. To fur- Foundation (NSF) under Grants CCF-0514750, and CCF-
ther confirm these rankings, we utilize the BAliBASE SP [22] 0646102
(BAliBASE-SP) score and total column score (BAliBASE-
TC), provided to score BAliBASE3.0 benchmarks, to score REFERENCES
the MSA results against the original RT-OSM alignment. The [1] L. Wang and T. Jiang, "On the complexity of multiple sequence
BAliBASE SP is the SP score discussed in previous sections. alignment." J Comput Biol, vol. 1, no. 4, pp. 337-48, 1994.
[2] D. Lipman, S. Altschul, and J. Kececioglu, "A Tool for Multiple
The BAliBASE TC is the percentage of similarity between Sequence Alignment," Proceedings of the National Academy ofSciences,
result MSA columns and the benchmark columns. Both of vol. 86, no. 12, pp. 4412-4415, 1989.
these reference scores give similar ranking order. [3] S. Needleman and C. Wunsch, "A general method applicable to the
search for similarities in the amino acid sequence of two proteins." J
Next, we score these alignment results using the follow- Mol Biol, vol. 48, no. 3, pp. 443-53, 1970.
ing scoring methods: Sum-of-pair (SP), Entropy2l 1, Valdar, [4] D. Feng and R. Doolittle, "Progressive sequence alignment as a pre-
Trident [2], [14], [23], HEP-PIMA, HEP-P250, and HEP- requisite to correct phylogenetic trees." J Mol Evol, vol. 25, no. 4, pp.
351-60, 1987.
BL62 (as in the previous section). These scoring methods [5] C. Notredame, D. Higgins, and 0. Journals, "SAGA: sequence alignment
are tested using their default parameters. A reliable scoring by genetic algorithm," Nucleic Acids Research, vol. 24, no. 8, pp. 1515-
method should be biased toward the motif columns and 1524, 1996.
[6] C. Notredame, D. Higgins, and J. Heringa, "T-Coffee: A novel method
give similar ranking as the manual ranking. These ranking for fast and accurate multiple sequence alignment," J. Mol. Biol, vol.
scores are represented in Table V. The sum-of-pair and Valdar 302, no. 1, pp. 205-217, 2000.
methods incorrectly rank T-COFFEE, PIMA, and DCA. The [7] J. Thompson, D. Higgins, T. Gibson et al., "CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
Trident method is not able to rank the T-COFFEE, PIMA, weighting, position-specific gap penalties and weight matrix choice,"
DCA, and DALIGN2 correctly. The Entropy2l method only Nucleic Acids Res, vol. 22, no. 22, pp. 4673-4680, 1994.
ranks the benchmark and MAFFT2 correctly. Observing the [8] R. Edgar and 0. Journals, "MUSCLE: multiple sequence alignment with
high accuracy and high throughput," Nucleic Acids Research, vol. 32,
no. 5, pp. 1792-1797, 2004.
'Entropy2l is the normalized Shannon's entropy where the residues are [9] J. Stoye, "Multiple sequence alignment with the Divide-and-Conquer
classified into 21 types (20 standard amino acid types and a gap type). method." Gene, vol. 211, no. 2, p. 56, 1998.
TABLE III
THEORETICAL SEQUENCE SET AND CONSERVATION SCORES
Columns
Seq. (a) (b) (c) (d) (e) (f) (g) (h) (i) () (k)
1 D D D D D D I p D L L
2 D D D D D D I p V L L
3 D D D D D D I p y L L
4 D D D D D D I p A L L
5 D D D D D D L W T
6 D D D E E E L W K
7 D D D E E E L W p
8 D D D E E E L W C
9 D D D E E F V S R
10 D E F E F F V S H
Methods Column Scores
HEP-PIMA 1.0 5.6110E-1 5.5493E-1 4.0856E-2 2.5792E-2 2.0805E-2 9.5824E-3 8.2756E-3 5.3628E-5 1.0 4.1152E-3
HEP-BL62 1.0 3.0795E-1 3.0794E-1 6.0156E-4 3.0645E-4 3.0092E-4 3.1249E-4 7.0552E-8 1.2000E-12 1.0 1.5242E-4
HEP-P250 1.0 2.4332E-1 2.4330E-1 1.1891E-4 6.1343E-5 5.9447E-5 2.7911E-7 1.2000E-12 0.0000E+0 1.0 2.0908E-7
Each label column represents a residue position in a multiple sequence alignment. Amino acids are identified by their one letter code and gaps by a dash
("-"). The column score correct order is (a) > (b) > (c) > (d) > (e) > (f), (g) > (h) > (i), and (j) > (k). Note: column (j) comes from an alignment
of only 4 sequences (no gaps); and the table cannot show all significant digits.
TABLE IV
RANKING OF THE ALIGNMENTS OF THE RT OSM SEQUENCE SET
Scoring Method Benchmark MAFFT2 ClustalW T-COFFEE DCA PIMA DALIGN2
Manual 1.0 > 0.983 > 0.817 0.799 > 0.741 0.717 0.633
BAliBASE-SP 1.0 > .954 > .864 > .854 > .847 .688 .620
BAliBASE-TC 1.0 > .720 > .660 > .630 > .550 .380 .270
HEP-PIMA 7.802 > 7.789 > 6.821 > 6.995 > 6.252 5.740 3.738 V/
HEP-BL62 6.328 > 6.321 > 5.335 5.330 > 4.883 4.759 2.484 V/
HEP-P250 5.848 > 5.848 > 4.850 4.856 > 4.416 4.365 2.207 V/
SP 23.02 > 22.743 > 22.208 > 21.708 21.900 20.406 18.316 x
Entropy2l 25.738 > 25.566 > 25.124 > 24.549 25.351 23.838 25.383 x
Valdar 22.968 > 22.69 > 22.149 > 21.612 21.838 20.318 18.275 x
Trident 19.935 > 19.628 > 19.333 > 19.004 19.349 18.075 20.09 x
The Manual scoring row provides the most reliable ranking via visual inspection of the alignment results. The check-mark (V/) at the end of
the row indicates a reliable and consistent ranking, and the cross (x) indicates an unreliable ranking.