©Aayudh Das

A multiple sequence alignment is a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned.

Sum of Pair (SP) method-


Methods for applying multiple sequence alignment
Three important methods are 1. Profiles. 2. PSI-BLAST. 3. Hidden Markov Model (HMMs).

Profiles express the patterns inherent in a multiple sequence alignment of a set of homologous sequences. They have several applications like -

Advantage1. They permit greater accuracy in alignments of distantly-related sequences. 2. Sets of residues that are highly conserved are likely to be part of the active site, and give clues to function. 3. The conservation patterns facilitate identification of other homologous sequences. 4. Patterns from the sequences are useful in classifying subfamilies within a set of homologues. 5. Set of residues that show little conservation, and are subject to insertion and deletion, are likely to be in surface loops. This information has been applied to vaccine design, because such regions are likely to elicit antibodies that will cross-react well with the native structure.

Working procedureThe basic idea in using profile patterns in identifying homologues is to match the query sequence from the database against the sequences in the alignment table, giving higher weight to positions that are conserved than to those that are variable. But one must not be too compulsive as in that case there is a chance of missing interesting distant relatives.


A quantitative measure of conservationFor each position in the table of aligned sequences, take inventory of the distribution of amino acids.


 It is evident that the positions 26, 27 and 29 contribute very high score and disagreement at these positions contributes a very low score.  For moderately conserved positions, such as position 28, we want a modest positive contribution to the score if the query sequence has an S or a W at this position, and a smaller contribution if it has T or Y.  So the general idea is to score each residue from the query sequence based on the amino acid distribution at that position in the multiple sequence alignment table.  A simple approach would be to use the inventories as scores directly.  The sequence VDFSAE would score 13+16+16+7+16+4=72  Thus we have to take inventory for each query sequence and will have to test all possible alignments with the multiple alignment table, and take the largest total score. It is obvious from these discussions that if the table contained a large and unbiased sample of sequences then the inventory would provide the correct picture of the potential distribution of residues at each position. With similar arguments we can say that if our sample were small, the pattern derived would be unlikely to reflect the complete repertoire.


How to make the inventory general?
 Let a1, a2, a3, ........a20 be the amino acid distribution at any residue position in a 20-membered array.  A better scoring scheme would evaluate any amino acid according to its chance of being substituted for one of the observed amino acids.  If D(i,j) is the amino acid substitution matrix (PAM250 or BLOSUM62) then amino acid i could score a1D(i,1)+a2D(i,2)......a20D(i,20)  Thus this scheme distributes the score among observed amino acids, weighted according to the substitution probability.

An amino acid in the query sequence could score higher either if it appears frequently in the inventory at this position or it has a probability of arising by mutation from residue types that are common at this position. A good approach is to use as the amino acid distribution a combination of the observed inventory and a general background level of the amino acid composition. The result is a set of probability scores for each amino acid (or gap) at each position of the alignment called a position specific scoring matrix (PSSM). PROSITE database is another example of the use of profile. The database of protein families is grouped on the basis of similarities in their sequences into a limited number of families. This is so, as proteins belonging to a particular family generally share functional attributes and are derived from a common ancestor.

Database searches with scoring matrix like PSSM
The PSSM is a scoring matrix . It represents an alignment of sequence patterns of the same length without gaps.  It is constructed by a simple logarithmic transformation of a matrix giving the frequency of each amino acid in the motif.  If the number of sequences with the found motif is large and reasonably diverse, the sequences represent a good statistical sampling of all sequences that are likely to be found with the same motif.


Concept of pseudocounts in PSSM If the data is small, then unless the motif has almost identical amino acids in each column, the column frequencies in the motif may not be highly representative of all other occurrences of the motif.  In such cases adding extra amino acid counts broaden the evolutionary reach of the profile to variation.  Pseudocounts are added based on previous variations seen in the aligned sequences. Expression for the frequencies in PSSM The probability pca that amino acid a is in column c in all occurrences of the blocks is (nca+bca)/(Nc+Bc Where nca = real counts, bca = pseudocounts , Nc = total number of real counts, Bc = pseudocounts. The Matrix The log odds ratio is calculated as before. Here, one column denotes each position and a row denotes each amino acid of the motif. As a sequence is searched with PSSM, the value of the first amino acid in the sequence is looked up in the first column of the PSSM, then the value of the second amino acid in the matrix and so on until the length scanned is the same as the motif width represented by the matrix. Its different a matrix Substitutions of the same amino acid within the matrix may be scored differently, depending on its position.  Amino acids in highly conserved positions score higher than those in weakly conserved positions. This matrix is used to score the next BLAST search and the matrix is refined again.


Many homologous proteins share only limited sequence identity. Such proteins may adopt the same three-dimensional structures but in pairwise alignments they may have no apparent similarity for those Position-specific iterated BLAST (PSIBLAST/c-BLAST) i.e. specialized kind of BLAST that looks deeper into the database to find distantly related proteins that match your protein of interest.




Sign up to vote on this title
UsefulNot useful