Professional Documents
Culture Documents
Q3.
(a) In about 200 words, summarize the BLOSUM family of matrices. What is the
purpose of these matrices? How are they computed?
Ans. BLOSUM matrices were first introduced in a paper by Steven Henikoff and
Jorja Henikoff. They scanned the BLOCKS database for vary conserved regions
of protein families (that do not have gaps in the sequence alignment) and then
counted the relative frequencies of amino acids and their substitution
probabilities.
BLOSUM (Blocks Substitution Matrix) matrices are used to score alignments
between evolutionarily divergent protein sequences. They are based on local
alignments. BLOSUM matrices are based on an implicit model of evolution.
To calculate a BLOSUM matrix, the following equation is used:
Here, p{ij} is the probability of two amino acids i and j replacing each other in a
homologous sequence, and q{i} and q{j} are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor lambda is a
scaling factor, set such that the matrix contains easily computable integer
values.
(b) In about 200 words, summarize the following: the ideal way to score a
multiple alignment, the Sum-of-Pairs (SP) method for scoring multiple
alignments and the minimum entropy method for scoring multiple alignments.
Give one or two shortcomings of teach of the latter approaches.
Ans. The scoring process of MSA is based on the sum of the scores of all
possible pairs of sequences in the multiple alignment according to some
scoring matrix. You can refer my previous article to learn about the different
scoring matrices and how to match them. where score(A, B) = pair-wise
alignment score of A, B. The sum-of-pairs (SP) score. It is defined on columns
and is the sum of all pairwise scores of the symbols in the column: p(a,b) is the
pairwise score for symbols a and b. we often draw conclusions about multiple
alignment by looking at the pairwise alignments.
Scoring an Alignment using minimum Entropy :
• basic idea: try to minimize the entropy of each column
• another way of thinking about it: columns that can be
communicated using few bits are good
• information theory tells us that an optimal code
uses bits to encode a message of probability p
• the messages in this case are the characters in a given column
• the entropy of a column is given by:
subproblems
so now as we can see
total n itration with u and total m itration with v so comlexty will be o(nm)
F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
P[i][j] = D, T or L according to which of the three expressions above is the maximum
Once we have computed F and P, we find the largest value in the rightmost column of
the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and
continue traceback until we hit the first column of the matrix. The alignment constructed
in this way is the solution