You are on page 1of 8

A Reliable Metric for Quantifying Multiple

Sequence Alignment
Ken D. Nguyen and Yi Pan
Department of Computer Science
Georgia State University
PO Box 3994
Atlanta, Georgia 30303
Email: knguyen@cs.gsu.edu and pan@cs.gsu.edu

Abstract-Aligning multiple homologous protein sequences of the insertion. For example, let the sequences to be aligned
(MSA) helps biologists identify the relationship between speciesbe s, : GLISVT and S2 : GIVT, then a possible alignment
and possibly predict the structure and functionality of the G L I V S T
protein. However, optimally aligning multiple sequences has s/
is A(si, s2) = G
been proven to be intractable by Wang and Jiang in [1]. For
the last two decades, researchers have often taken different To align two sequences, dynamic programming [2], [3],
heuristic approaches to solve this problem without a consistent gives the optimal alignment of the sequence in 0(n2) time,
and reliable scoring method. In this paper, we have developed where n is the length of the sequences. Dynamic programming
a scoring metric (Hierarchical Expected matching Probability
[HEP]), that measures the probability of residue mutations and run-time and space grow exponentially proportional to the
the biological correctness of MSA results. Both theoretical and number of sequences to be aligned. Commonly, the MSA
manual selected test sequences have shown that our quantitative problem is solved by heuristics methods [4]-[10] [11], or by
metric is more reliable, consistent, and biologically meaningfulprobabilistic methods [12], [13]. The general approaches often
than many commonly used scoring metrics. fall into two categories: global alignment and local alignment.
I. INTRODUCTION In global alignment, overall score optimization is the central
force that makes an alignment span the entire length of the
In a protein sequence, conservation regions (i.e. sequence sequences. Local alignment focuses on aligning small regions
motifs) are groups of biologically significant amino acids that of the sequences that are believed to be conserved and have
are conserved throughout the evolution. Correctly detecting the a significant biological meaning, and this technique seldom
sequence motifs helps identify the evolutionary relationships yields an optimal alignment. Since each method is somewhat
between species or predict the structure and functionality of a different, each has its own scoring scheme to maximize its
protein. This task is very critical since proteins play various matching and alignment scores. These scoring schemes are
important roles in the body such as antibodies, enzymes, etc. biased, inconsistent and often do not reflect the biological
Protein may be associated with a particular disease; therefore, meaning of each conservation region of the sequence.
understanding the structure and functionality of the protein In all MSA algorithms, the basic step of assembling the
helps in finding a cure for such disease. The most com- result is qualitatively comparing the goodness of matches
mon way to detect sequence motifs is arranging homologous between two or more residues in a column. The placement of
sequences, or multiple sequences alignment (MSA), so that a residue into a column relies on this measurement. And the
similar conserved regions between these sequences are aligned correctness of the MSA result depends heavily on how reliable
into the same columns. However, optimally aligned conserved the quantify function is. Therefore, developing a scoring metric
regions across all the aligning sequences was proven to be for the MSA problem is as significant as aligning multiple
intractable by Wang and Jiang in [1]. sequences. This is the objective of our work.
When aligning the sequences, a gap (-) can be inserted into This paper is organized as follows: we will analyze various
the original sequences to maximize the number of columns of existing scoring methods; a quantitative measurement method,
similar type residues. The MSA can be generalized as follows: called Hierarchical Expected matching Probability (HEP) scor-
Let W = {ai,. ,i,... a1n},i > 1, be a set of amino ing metric, is proposed, a performance analysis of this new
acid symbols. A sequence si = .i ...j ok is a string method is presented followed by the conclusion and possible
n enhancements of the method.
of n amino acid symbols. An MSA is a transformation of
k sequences {s, s22, k into {1s, s... s'}, where s'
s} II. EXISTING SCORING METHODS
is si with GAP (-) symbols inserted such that maximizing In 2001, W. S. J. Valdar [14] comprehensively analyzed 18
the similar regions across the sequences. In general, when a different MSA scoring methods and categorized them into 6
gap is inserted into a sequence a penalty is applied to the groups based on their scoring characteristics and information
alignment. The weight of the penalty depends on the location they utilize while scoring. These groups are symbol frequency,

1-4244-1509-8/07/$25.00 02007 IEEE 788


symbol entropy, stereochemical property, mutation data, and score as follows:
weighted score. A theoretical set of sequences, shown in Ta- n n
ble III, for testing the validity of a scoring metric was proposed S(x) AS S wiwjM(si(x)sj (x)),
along with a list of criteria that a quantified measurement of
= (2)
i J>i
conservation should satisfy. The criteria are as follows:
1) Mathematical properties: mapping aligned columns into where n is the sequence length and A scales S(A) to range
a continuous and bounded output space [0,1], that is, 1
2) Stereochemical properties: recognizing and utilizing A -- (3)
Li En
=n
conservative replacements based on chemical and phys-
E Ej>iSWwij

ical properties where wi is the weight of sequence si such that


3) Amino acid frequency: utilizing amino acid frequencies
in a column i= n (4)
4) Gaps: providing appropriate gap penalties for insertions
E
(i, j)

and deletions where d(si, sj) is the distance between sequence si and sj
5) Sequence weighting: accounting for sequences with dif- and is calculated as
ferent weights, and 1
6) Simplicity: easy to implement and use. d(si, sj) = 1 M (si (x), sj (x)), (5)
The summary of these 18 scoring methods and their per-
n X alignij xCEalign2j
formances on the test data are shown on Table II, where the where alignij is the set of all positions that manifest an
check-mark (V/) represents the capability to rank the column amino acid in one or both of si and sj, and n x alignij is
correctly, and the numbers in the last column represent the the size of this set.
mentioned requirements that are satisfied.
Among these 18 analyzed scoring methods, the Sum-of- Matrix M is a linear transformation of a substitution matrix
pair method is the most commonly used. Sum-of-pair scoring m such that all values in M are in range [0-1]. M is defined
method for aligning n sequences of max length n is defined as:
as: f max(a,b) -min(m)
n n b if a 7t gap and b :t gap
S(X) = E E (Ssi(), Sj2W), (1)
b 0, otherwise
i j7i
where si (x) is the amino acid at column x in the ith sequence,
(6)
Valdar and Thornton's method correctly ranks columns (j)
and S(a, b) is the score for aligning two amino acids a and b. and (k) in the test sequences shown in Table III, however,
However, defining an adequate gap penalty for this method this method inherits the inconsistency found in the sum-of-
is difficult. For instance, aligning a segment of sequence pair score.
si = ... MI... with pre-aligned sequences A = {s= An attempt to generalize a scoring method that quantifies
... LLM... and 5k =... IMI... } yields more than one
both residue conservation and divergence, called Trident [14],
alignment. If BLOSUM62 ( [15]) substitution matrix and a is as follows: for a position x, the conservation score is
gap penalty of -1 are used to score the alignments, alignment calculated as:
of
rs L L L SW (I t(x))'(' -,r(x))'(' g(x))
= .
(7)
A(si,sj,sk) j s: I M I yields a score of -5, where t is a function of normalized symbol diversity relating
M I to Shannan's entropy, r is a function of residue diversity, g
which is higher than the alignment of is a function of gap cost, and a, j3, and -y are variables.
rs L L L Trident method is far more complex than other methods due
A(si,Sj,s) jO s: I M I yielding a score of -8. to problems of selecting appropriate values for the variables,
s: M I and the complexity of the three embedded functions.
If, instead, the penalty score is chosen to be -10, both
of these alignments yield the same score. In addition, the III. HIERARCHICAL EXPECTED MATCHING PROBABILITY
SP scoring method calculates all-pair scores of residues in a SCORING METRIC (HEP)
column at every step of alignment, which increases the runtime To align k sequences of length n optimally, we explore k
of an MSA algorithm by 0(n2) for aligning n sequences dimensions of solution space getting the optimal solution in
of length n. Because it is simple, despite its inconsistency 0(nk) run time. Fundamentally, the probability of aligning a
and slow-down factor, SP combining with biological and residue a from sequence si to a residue b from sequence sj is
probabilistic weight factors is the most commonly used scoring p(a b) < 1. The probability of a residue from each sequence
method in MSA tools. aligned into one column k is P(k) = Hi'U(a b), P(k) -> 0.
Valdar and Thornton [14] attempted to resolve the limita- This probability decreases exponentially as the number of
tions of existing scoring methods by proposing a conservation aligning sequence increases. Therefore, the scoring function

1-4244-1509-8/07/$25.00 02007 IEEE 789


BLOSUM62 Scoring Tree Level
should reward aligned column scores in converse with the x
matching probability function because smaller the random
matching probability in a column indicates a possible con-
servation between the sequences.
1\ 17
In addition, the overall alignment length should be as small c e m p i
2 2 2 2 I
as possible; this means that gaps in the sequences are mini- /\ \A IA A \
a b d 1 k o n i h 2
mized. A good MSA algorithm maximizes the homogenous
residue columns and minimizes the length of the overall 13 8 A 3
/< \/\ \nA\
C I V L M F W Y H ND E Q K R S T
3 3 3 3 2
A'
A G P i
alignment. MSA is not simply about aligning similar residues 13 8 8 8 9 10 15 11 12 10 10 9 9 9 9 89 8 10 11

of aligning sequences, maximizing the matching score, but Fig. 1. HEP scoring tree generated from BLOSUM62 substitution matrix.
is also about biologically aligning the meaningful "true" The matching cardinality of each amino acid class is shown at the bottom of
conservation regions (motifs) together. This aspect of MSA the corresponding class symbol.
should be decided by biologists, and a pure mathematical
metric such as scoring methods in the symbol frequency or Motif II
symbol entropy groups is generally insufficient. Furthermore,
MSA algorithms almost always assemble their alignment by NVVO: t i L D L g d a y
aligning a few (or few pairs) sequences at a time. A scoring Motif III
metric that gives a score only after the alignment has been
assembled is not sufficient. A good scoring metric should be SFV1: t t L D L t n g f L P Q L
flexible enough so that MSA algorithms are able to use it to
Fig. 2. Matching a Leucine from NNVO with a similar amino acid in
progressively align and assemble their alignments. We propose sequence SFV1. The dotted arrow shows a mismatch, the dashed arrow shows
a scoring metric, namely Hierarchical Expected matching a better match, and the solid arrow shows the best match. Sequences adapted
Probability (HEP), that quantifies residue matching divergence from [19].
via the amino acid class covering hierarchy (AACH) [11],
[16] and residue matching conservation via inverse random
matching probability function. The cardinalities for the AACH by nx (n-1)
2 , since there are nx
n (n- 1)-ws
2 possible pair-wise
are generated from any given biological meaningful scoring scores for a node with n distinct residues.
matrix, substitution matrix, or any set of quantitative values. 4) Calculate the divergence: a more diverged amino acid
class should have smaller matching score than the less
diverged ones. As we traverse to the root of the AACH
A. Quantifying Residue Divergence via AACH Scoring Tree tree, the divergence increases. Thus, we choose to divide
Scoring matrices and substitution matrices such as PAMs the AACH node scores by their levels to reflect their
(point of accepted mutation - [17], [18]) or BLOSUMs (blocks divergency.
substitution matrix - [15]) provide all pairwise matching scores 5) Adjusting the calculated scores by the root score: match-
for the amino acids. To quantify the divergence of two or more ing at the root, where all amino acids match, is biologi-
residue types, we use the average sum of all pairs scores. The cally irrelevant. Thus, we want the root score to be zero
details of the algorithm is as follows: to simplify future calculation. If the root score are not
1) Scale scoring/substitution matrix: the minimum score in zero, all calculated scores in the AACH tree are adjusted
a scoring matrix can be negative, zero, or positive. We by the root score.
want the minimum score starts at zero so that when we 6) Round up the score (optional): Round up to the nearest
calculate the probability, we would have a meaningful integer.
value (probability must be between 0 and 1]. One of the For example: The score of node c in the AACH tree using
way to do this is to subtract the minimum score from BLOSUM62 matrix is calculated as follows:
S(I,V)+S(I,L)+S(I,M)+S(VL)+S(VM)+S(L,M)1
every score in the matrix. For example, BLOSUM62 C
C-~~~~~~33X4(4-1)l
x
2
has a minimum score of -4 for matching cysteine and F(7 + 6 + 5 + 5 + 5 + 2)/18] 2, where S(a,b)
glutamate. We will subtract -4 from every value in the is the matching score between two residues a and b from
matrix; thus, score for matching cysteine and glutamate BLOSUM62 matrix. The pairwise matching scores showing
becomes zero, and all other matching score will be in this example have been scaled up by 4, (subtracting the
scaled up by 4. minimum score of -4 from every score in the matrix).
2) Scoring AACH leaves: the score of matching a residue
to itself, leaves of AACH tree, is extracted from the B. The scoring metric
scoring matrix. The score of an alignment is the summation of all column
3) Scoring AACH internal nodes: starting from the leaf scores in the alignment.
nodes, skip the leaves, to the root of the tree, calculate lAk
the sum of all pair-wise scores from the scoring matrix
for each internal node and divide the summed scored
Score(Ak) =JE cScore(i)
i=l
(8)

1-4244-1509-8/07/$25.00 02007 IEEE 790


where: Level
k: number of sequences, k > 2
Ak: an alignment of k sequences
lAl: length of the alignment .3 3
The column score is defined as the sum of all divergence
scores between residue conservation groups. And the residue 1:2 k:2 2
conservation group score is the inverse of the conserved
residue matching probability function. The equation for
calculating the column score is as follow: N:1 D:1 E:1 1
10 10 9
f 0, if column i matches only at tree root
cScore(i) 1 logIT'0,il \-<T, (count(nodej)- 1)' Fig. 3. HEP-BLOSUM62 active scoring tree for the 3rd column. The number
next to the node symbol is the node count and the number below the symbol
I (k-17i) l1=1 j=l I
is the node cardinality.
(9)
The score (weight) of aligning column i of alignment p, p[i
and column j of alignment q, q[j], is the column-score of
column generated by merging all the residues in p[i] and q[j]. Since (n 1)C > 2 x [(n _x 1) c + (x -l)c],Vc > 1. To
The alignment score equation then becomes: prove (n -1)c > (n-2 x_ )c + (x 1)c + (n - is to
prove ( Z x (n l)c > (n -1) . This is true Vl > 1.
Similarly, the proof can be extended to any degree of
divergent d, d < n. If there are d types of amino acids at the
Score(Ak)
1
(k_ l)r
x leaf level of the scoring tree, the column score will be smaller
than a similar tree with divergent degree d-1 because splitting
one of the leaf groups into two groups from a d -1 degree
>3Akl log0Til ,|Ti'l (count(nodej)-l)b divergent tree make it a d degree divergent tree.
i=l 1=1 j=l 1 (
(I10)
VW, > 0 D. Examples
where: Given the following sequences: si = NNN, S2 = NND, and
Ti : the size of score tree for column i, i.e. the active scoring S3 = NDE. A possible alignment is
tree.
Tt : the size of the score tree at level 1. sI NNN
cj: the cardinality of node j in the active scoring tree. A= s2 NND.
Fi: the maximum cardinality of the active scoring tree built s3 NDE
for column i. Using the matching tree build from BLOSUM62 matrix,
count(nodej): the number of residue matched at nodej in the shown in Figure 1, the alignment score derived from
active scoring tree. Equation 8 is:
C. Proof of HEP scoring metric consistency score(A) =(3 10 ([(3- 1)10] +
In this section, we will prove that any aligned sequence
[(2- 1)10 + (3 1)3] +
column with a lower degree of divergent will have a much [( 2 + (2 2 + (3 1) 1.00716
higher alignment score than any column with a higher degree
of divergent. (Figure 3 shows the active scoring tree for the third column
Given a column of k aligned sequences, k > 1, containing n in this example.)
residues, n < k. Let cl > 0 be the cardinality of two residues E. Sequence weighting factor
matching at level 1,1 < log5, c1 > 1, and ci- ci+ > 0. The Sequence weighting is a factor that should be included
score of the column is 1 if all the residues are homogeneous; in a scoring method. Aligning a residue a from motif x in
otherwise, the score of the column will be (n x)FX [(n- sequence si to a similar residue b in the same core motif in
1-al)cl +t (n 1 a2)cl + + (n 1
)c*c*+
+* sequence sj should have more weight than aligning a to b in
(ai ±aj -1 - .++ (n5-1) 0 Vct > landa > 1,i j,1>i,j <
a different motif or to any other residue from other locations,
n and 1 < t < logn. as in Figure 2.
Let c = min(cl) and a = c -cl+l. If there are only two types This depends on how much biological information about
of residues (divergent of degree 2), the column score will be the sequences we have. In some instances, there are groups
(k- x -1)c+ (x 1)c+ n 2 1< x < k and a > 0. To prove of similar sequences among aligning sequences. Normalizing
that the homogenous column has a higher score, we need to this redundancy will reduce bias in MSA results. However,
prove (n )c >(n - x -)c+ (x I)c +(-) I > 2. biologists may include a subset of similar sequences into

1-4244-1509-8/07/$25.00 02007 IEEE 791


TABLE I I II III IV V VI
THEORETICAL CONSERVATION TEST SEQUENCES [14] kFLGqi i
HT13 pvkKa-- t -IDLkdaf -LPQG- fk qYMDDI 1 shGL--
Columns NVVO ikkK--- tiLDIgday -LPQG-wk -YMDDIyi qyGFM- kWLGf el
SFV1 pvpKp-- ttLDLtngf -LPQG- f aYVDDIyi naGYVv eFLGfni
Seq. (a) (b) (c) (d) (e) (f) (g) (h) (i) (i) (k) HERVC pvpKp-- tcLDLkdaf -LPQR- fk qYVDDL11 tvGIRc cYLGfti
1 D D D D D D I P D L L GMG1 mvrKa-- tkVDVraaf -CPFG -la aYLDDI1i -GLN-
- kYLGf iv
GM17 v-pKkqd ttIDLakgf -MPFG- lk vYLDDIiv --NLK- tFLG- hv
2 D D D D D D I P V L L MDG1 lvpKksl scLDLmsgf -LPFG- lk 1YMDDLvv --NLK- tYLG hk
-

3 D D D D D D I P Y L L MORG vvrKk-- ttMDLqngf -APFG- fk 1YMDDIiv --GLK- hFLG- hi


CAT1 1 vdKpkd eqMDVktaf kSLYG- lk 1YVDDMli -EMK-
- rILGidi
4 D D D D D D I P A L L CMC1 titKrpe hqMDVktaf kAIYG- lk 1YVDDVv i KR-
--- hFIGiri
CST4 ftkKrng t -LDInhaf kALYG -lk vYVDDCvi inKLK- dILGmdl
5 D D D D D D L W T C1095 fnrKrdg tqLDIssay kSLYG- lk 1FVDDMi1 itTLKk dILGlei
6 D D D E E E L W K NDMO mihKt-- afLDIqqaf gVPQGsvl tYADDTav tsGL-- kYLGitl
NL13 lipKp-- s -IDAekaf gTRQGcpl 1FADDMiv vsGYK- kYLGiql
7 D D D E E E L W P NLOA fipKa-- afLDIegaf gCPQGgvl gYADD Ivi evGLN- kYLGvi -

8 D D D E E E L W C NTC0 vlrKp--
ICD0 eipKp--
amLDGrnay
vdIDIk-gf
gVRQGmvl
gTPQGgil
aYLDDVtv
rYADDFki
alGIE-
rlDLDi
rVLGagv
dFLGfkl
9 D D D E E F V S R IAGO fkkKt-- ieGDIks -f gVPQGgii rYADDWlv elKITl -FLGvnl
10 D E F E F F V S H ICS0 wipKp-- ldADIsk-c gTPQGgvi rYADDFvi emGLE 1 nFLGfnv
IPLO yipKs-- leADIr-gf gVPQGgpi rYADDFvv srGLV1 dFVGfnf
Each label column represents a residue position in a multiple sequence align-
ment. Amino acids are identified by their one letter code column (k) comes Fig. 4. The RT OSM sequences. The six motifs of the RT OSM are indicated
from an alignment of 10 sequences where column (j) comes from an alignment by roman numeral(I-VI). The bold and capitalized letters represent the core
of only 4 sequences (a) > (b) > (c) > (d) > (e) > (f), (g) > (h) > (v), amino acids of each motif. Adapted from [19]
and (j) > (k).

the aligning sequences to guide the alignment toward known to a phenylalanine (F), the probability for all five residues
sequences. Normalizing and factorizing out the redundancy (either D or E) in column (d) mutating to other amino acids
in this case is not useful and may lead to unwanted results. H|5
iS (PDE)5 = PDE [where PDE is the probability that D is
Therefore, the weight factor should be from a biologist or a mutated into E], which is much smaller than the probability
function that measures the significance of a residue in a motif of residue D in column (c) mutating to residue F. This result
to a motif rather than the redundancy of similar sequences. If follows the decomposition of the BLOSUM62 substitution
the weight function is known, we can combine this information matrix and its log odd scoring function from [20]
into our scoring metric. The column score in Equation 9 will
be S(a, b) = Ilog fafb
A
Pab (13)
r 0, if column i only matches at tree root
cScore(i) = w(i) Y:Og1T'O1i1 |Tz (count(nodej)-1)i where S(a, b) is the score of aligning residues a and b,
I (k -)r 1=1 j=l ( A = 0.347 is a constant, and fi is the background frequency
of residue i. All amino acid background frequencies are
where w(i) is the the motif weight of residues in column i derived from the existing sequences [20]. With the background
and is calculated as frequencies fD = 0.0539, fE = 0.0539 and fF 0.469
r 1, iff residues are from a similar motif and obtained from [20], we found that PDE = 5.8E 3 and
are biological and locational order equivalent fDF = 9.OE - 4. Thus, H5PDE 6.56357E 12 <
w(i) a, iff residues are from similar motif and PDF= 9.OE-4. Thus, column (d) is more informative since it
indicates that F is conserved and possibly a core-motif residue
] are biological equivalent across nine out of the ten sequences, while column (d) fails to
3, otherwise indicate which residue is the most conserved residue among
(12) these sequences. Therefore, we rearrange these columns in the
where 0 < Q < a < 1. order of mutation probability.
In the next section, the performance of HEP scoring metric
is evaluated. B. RT-OSM sequences
IV. EVALUATION OF THE SCORING METRIC In 1999, Hudak and McClure [19] selected a set of twenty
reverse transcriptase sequences, shown in Figure 4, to test
We use two sequence sets that have been used widely
to valid scoring methods to test our scoring function. The
the reliability of seven commonly used MSA tools. These
theoretical data set is from [14] and the manual selected data sequences are called order-specific-motifs (OSM) because they
set is from [19].
contain a set of motifs occurring in a specific order among the
sequences.
A. Theoretical sequences
Table III includes the theoretical sequences proposed by C. Results
Valdar [14] for testing the goodness of MSA scoring functions. For the theoretical sequence set, we rank the sequence
In the original sequences, column (d) in Table I was columns using HEP and compare its results with predefined
considered having higher conservation score than column ranks. For the RT-OSM sequence set, we utilize available MSA
(c) without supporting justification. Even though the chance tools to align the sequences and then rank the result alignment
aspartic acid (D) will be mutated to glutamic acid (E) is with HEP. After all, we compare HEP result with our manual
greater than the chance that aspartic acid (D) will change ranking.

1-4244-1509-8/07/$25.00 02007 IEEE 792


1) Ranking the theoretical sequence set: Table III sum- ranking of our method, HEP-PIMA ranks CLUSTALW better
marizes the result of our measurements. We run three tests, than T-COFFEE. This behavior could be the result of evenly
one for each of the generated scoring tree in Section III-A, distribution of the cardinalities in PIMA scoring tree, where
where HEP-PIMA is HEP score using PIMA cardinality, HEP- amino acid classes at the same level get the same cardinality.
P250 is HEP score using generated PAM250 scoring tree, This feature simplifies the math but may yield an undesire
and HEP-BL62 is HEP score using generated BLOSUM62 result. This is why PIMA scoring scheme has been criticized
scoring tree. The correct ranking of the column scores are: for being ad-hoc and not biological meaningful. The remaining
(a) > (b) > (c) > (d) > (e) > (f), (g) > (h) > (i), and HEP-P250 and HEP-BL62 rank the result consistently with
(j) > (k). All three tests show our scoring method correctly the actual ranking order. Overall, the rankings of HEP-PIMA,
ranks the columns in these orders. Column (b) score is greater HEP-P250 and HEP-BL62 are more consistently and reliable
than column (c) score in all three tests even though both of than any other testing methods.
them have the same degree of convergence.
2) Ranking the RT OSM sequences: A multiple sequence V. CONCLUSION
alignment tool is expected to detect the motifs in the aligning The biological reliability and consistency of an MSA result
sequences and align them together. We use the percentage largely depends on the scoring metric used. The lack of a
of correctly aligned motifs and their core symbols to rank simple and reliable scoring method that combining the biolog-
multiple sequence alignment tool results. The pre-aligned and ical, stereochemical, physicochemical properties, amino acid
manual adjusted RT-OSM set (Benchmark) yields the top rank symbol occurring frequency and probability, gap penalty and
(100%). We remove the gaps in the RT-OSM sequence set and sequence weight factors of aligning protein (or DNA/RNA)
utilize the following multiple sequence alignment tools PIMA, sequences makes the task of aligning multiple sequences
DCA, CLUSTALW, DALIGN2, MAFFT2, and T-COFFEE, (MSA) a very complicated task. We have carefully analyzed
[6], [7], [9], [10], [13], [21], to align the set. this problem and developed a scoring method [Hierarchical
An alignment is biologically meaningful only if the core Expected matching Probability-HEP] proven to be simple,
motifs are correctly aligned. Therefore, the alignment results flexible, and biological reliable through both theoretical and
are visually inspected for this quality. The percentages of manual selected tests. Beside guiding the MSA algorithm to
correctly aligned core motifs of the results are used to rank a more biological meaningful alignment, HEP method can be
them. This ranking information, depicted in Table V as manual reliably used by biologists for selecting the best MSA tool for
row, is used to validate the correctness of other scoring their tasks. A possible future research direction is to implement
methods. In this manual ranking, all non-motif columns of an MSA tool utilizing this scoring function.
the alignment results are ignored due to their non-significant
ACKNOWLEDGMENT
information contribution. The correct ranking order of the
results is Benchmark > MAFFT2 > CLUSTALW This research was supported in part by the National Science
-

T -COFFEE > PIMA > DCA > DALIGN2. To fur- Foundation (NSF) under Grants CCF-0514750, and CCF-
ther confirm these rankings, we utilize the BAliBASE SP [22] 0646102
(BAliBASE-SP) score and total column score (BAliBASE-
TC), provided to score BAliBASE3.0 benchmarks, to score REFERENCES
the MSA results against the original RT-OSM alignment. The [1] L. Wang and T. Jiang, "On the complexity of multiple sequence
BAliBASE SP is the SP score discussed in previous sections. alignment." J Comput Biol, vol. 1, no. 4, pp. 337-48, 1994.
[2] D. Lipman, S. Altschul, and J. Kececioglu, "A Tool for Multiple
The BAliBASE TC is the percentage of similarity between Sequence Alignment," Proceedings of the National Academy ofSciences,
result MSA columns and the benchmark columns. Both of vol. 86, no. 12, pp. 4412-4415, 1989.
these reference scores give similar ranking order. [3] S. Needleman and C. Wunsch, "A general method applicable to the
search for similarities in the amino acid sequence of two proteins." J
Next, we score these alignment results using the follow- Mol Biol, vol. 48, no. 3, pp. 443-53, 1970.
ing scoring methods: Sum-of-pair (SP), Entropy2l 1, Valdar, [4] D. Feng and R. Doolittle, "Progressive sequence alignment as a pre-
Trident [2], [14], [23], HEP-PIMA, HEP-P250, and HEP- requisite to correct phylogenetic trees." J Mol Evol, vol. 25, no. 4, pp.
351-60, 1987.
BL62 (as in the previous section). These scoring methods [5] C. Notredame, D. Higgins, and 0. Journals, "SAGA: sequence alignment
are tested using their default parameters. A reliable scoring by genetic algorithm," Nucleic Acids Research, vol. 24, no. 8, pp. 1515-
method should be biased toward the motif columns and 1524, 1996.
[6] C. Notredame, D. Higgins, and J. Heringa, "T-Coffee: A novel method
give similar ranking as the manual ranking. These ranking for fast and accurate multiple sequence alignment," J. Mol. Biol, vol.
scores are represented in Table V. The sum-of-pair and Valdar 302, no. 1, pp. 205-217, 2000.
methods incorrectly rank T-COFFEE, PIMA, and DCA. The [7] J. Thompson, D. Higgins, T. Gibson et al., "CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
Trident method is not able to rank the T-COFFEE, PIMA, weighting, position-specific gap penalties and weight matrix choice,"
DCA, and DALIGN2 correctly. The Entropy2l method only Nucleic Acids Res, vol. 22, no. 22, pp. 4673-4680, 1994.
ranks the benchmark and MAFFT2 correctly. Observing the [8] R. Edgar and 0. Journals, "MUSCLE: multiple sequence alignment with
high accuracy and high throughput," Nucleic Acids Research, vol. 32,
no. 5, pp. 1792-1797, 2004.
'Entropy2l is the normalized Shannon's entropy where the residues are [9] J. Stoye, "Multiple sequence alignment with the Divide-and-Conquer
classified into 21 types (20 standard amino acid types and a gap type). method." Gene, vol. 211, no. 2, p. 56, 1998.

1-4244-1509-8/07/$25.00 02007 IEEE 793


[10] R. Smith and T. Smith, "Pattern-induced multi-sequence alignment
(PIMA) algorithm employing secondary structure-dependent gap penal-
ties for use in comparative protein modelling," Protein Engineering
Design and Selection, vol. 5, pp. 35-41, 1992.
[11] R. F. Smith and T. F. Smith, "Automatic Generation of Primary Sequence
Patterns from Sets of Related Protein Sequences," Proceedings of the
National Academy of Sciences, vol. 87, no. 1, pp. 118-122, 1990.
[12] C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons:
Probabilistic consistency-based multiple sequence alignment," Genome
Res., vol. 15, pp. 330-340, 2005.
[13] K. Katoh, K. Misawa, K. Kuma, and T. Miyata, "MAFFT: a novel
method for rapid multiple sequence alignment based on fast Fourier
transform," Nucleic Acids Research, vol. 30, no. 14, pp. 3059-3066,
2002.
[14] W. Valdar, "Residue conservation in the prediction of protein-protein
interfaces," Ph.D. dissertation, University College London, 2001.
[15] S. Altschul, "Amino Acid Substitution Matrices from an Information
Theoretic Perspective," J. Mol. Bd, vol. 219, pp. 555-565, 1991.
[16] W. Taylor, "The classification of amino acid conservation." J Theor Biol,
vol. 119, no. 2, pp. 205-18, 1986.
[17] M. Dayhoff, R. Schwartz, and B. Orcutt, "A model of evolutionary
change in proteins. Matrices for detecting distant relationships," Atlas
of Protein Sequence and Structure, vol. 5, no. Suppl 3, pp. 345-358,
1978.
[18] S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from
Protein Blocks," Proceedings of the National Academy of Sciences,
vol. 89, no. 22, pp. 10915-10919, 1992.
[19] J. Hudak and M. McClure, "A comparative analysis of computational
motif-detection methods," Pacific Symposium on Biocomputing, vol. 4,
pp. 138-149, 1999.
[20] S. Eddy, "Where did the BLOSUM 62 alignment score matrix come
from?" Nature Biotechnology, vol. 22, no. 8, pp. 1035-1036, 2004.
[21] B. Morgenstern, "DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment," Bioinformatics, vol. 15, pp.
211-218, 1999.
[22] J. D. Thompson, P. Koehl, R. Ripp, and 0. Poch, "BAliBASE 3.0: latest
developments of the multiple sequence alignment benchmark." Protein,
vol. 61, pp. 127-136, 2005.
[23] C. Shannon and W. Weaver, "The Mathematical Theory of Information,"
Urbana: University of Illinois Press, vol. 97, 1949.

1-4244-1509-8/07/$25.00 02007 IEEE 794


TABLE II
SUMMARIZED RESULTS OF ANALYZED SCORING METHODS
Column Ranking
Scoring Method (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) Req. # satisfied.
Kabat V/ V/ V/ V/ V V/ V/ V/ V/ 3,6
Jores v v v v v v v v/ V/ 1, 3
Lockless V V V V V V V V V VV/ / 3
Schneider V V V V V/ V/ 1, 3, 6
Shenkin V V V V V/ V/ 1, 3, 6
Gerstein V V V V V/ V/ 1,3,6
Taylor V V V V V/ V/ 1,2,4,6
Zvelibil V V V V V V/ V/ 1,2,6
Karlin (SP) V v V V V V V V V 1, 2, 3, 4, 6
Armon V V V V V V V V V 1, 2, 4
Thompson (Clustal) V V V V V V V V V 1, 4, 5
Lancet V V V V V V V V V 1, 2, 3, 4
Mirny & Williamson V V V VVA 1, 2, 3, 6
Goldstein V V V V V V V V V V 1, 2, 4, 6
Smith (Pima) V V V V V V V V V V 1, 2, 4, 6
Valdar & Trident / +/ +/ +/ +/ +/ +/ +/ +/ +/ +/ 1, 2, 3, 4, 5
HEP V V V V V V V V V V 1, 2, 3, 4, 5, 6
This table summarizes the result of analyzed scoring methods claimed by Valdar in [14] and HEP scoring metric.
The last column represents requirements in Section II that are satisfied.

TABLE III
THEORETICAL SEQUENCE SET AND CONSERVATION SCORES
Columns
Seq. (a) (b) (c) (d) (e) (f) (g) (h) (i) () (k)
1 D D D D D D I p D L L
2 D D D D D D I p V L L
3 D D D D D D I p y L L
4 D D D D D D I p A L L
5 D D D D D D L W T
6 D D D E E E L W K
7 D D D E E E L W p
8 D D D E E E L W C
9 D D D E E F V S R
10 D E F E F F V S H
Methods Column Scores
HEP-PIMA 1.0 5.6110E-1 5.5493E-1 4.0856E-2 2.5792E-2 2.0805E-2 9.5824E-3 8.2756E-3 5.3628E-5 1.0 4.1152E-3
HEP-BL62 1.0 3.0795E-1 3.0794E-1 6.0156E-4 3.0645E-4 3.0092E-4 3.1249E-4 7.0552E-8 1.2000E-12 1.0 1.5242E-4
HEP-P250 1.0 2.4332E-1 2.4330E-1 1.1891E-4 6.1343E-5 5.9447E-5 2.7911E-7 1.2000E-12 0.0000E+0 1.0 2.0908E-7
Each label column represents a residue position in a multiple sequence alignment. Amino acids are identified by their one letter code and gaps by a dash
("-"). The column score correct order is (a) > (b) > (c) > (d) > (e) > (f), (g) > (h) > (i), and (j) > (k). Note: column (j) comes from an alignment
of only 4 sequences (no gaps); and the table cannot show all significant digits.

TABLE IV
RANKING OF THE ALIGNMENTS OF THE RT OSM SEQUENCE SET
Scoring Method Benchmark MAFFT2 ClustalW T-COFFEE DCA PIMA DALIGN2
Manual 1.0 > 0.983 > 0.817 0.799 > 0.741 0.717 0.633
BAliBASE-SP 1.0 > .954 > .864 > .854 > .847 .688 .620
BAliBASE-TC 1.0 > .720 > .660 > .630 > .550 .380 .270
HEP-PIMA 7.802 > 7.789 > 6.821 > 6.995 > 6.252 5.740 3.738 V/
HEP-BL62 6.328 > 6.321 > 5.335 5.330 > 4.883 4.759 2.484 V/
HEP-P250 5.848 > 5.848 > 4.850 4.856 > 4.416 4.365 2.207 V/
SP 23.02 > 22.743 > 22.208 > 21.708 21.900 20.406 18.316 x
Entropy2l 25.738 > 25.566 > 25.124 > 24.549 25.351 23.838 25.383 x
Valdar 22.968 > 22.69 > 22.149 > 21.612 21.838 20.318 18.275 x
Trident 19.935 > 19.628 > 19.333 > 19.004 19.349 18.075 20.09 x
The Manual scoring row provides the most reliable ranking via visual inspection of the alignment results. The check-mark (V/) at the end of
the row indicates a reliable and consistent ranking, and the cross (x) indicates an unreliable ranking.

1-4244-1509-8/07/$25.00 02007 IEEE 795

You might also like