You are on page 1of 6

Assignment

Q3.
(a) In about 200 words, summarize the BLOSUM family of matrices. What is the
purpose of these matrices? How are they computed?
Ans. BLOSUM matrices were first introduced in a paper by Steven Henikoff and
Jorja Henikoff. They scanned the BLOCKS database for vary conserved regions
of protein families (that do not have gaps in the sequence alignment) and then
counted the relative frequencies of amino acids and their substitution
probabilities.
BLOSUM (Blocks Substitution Matrix) matrices are used to score alignments
between evolutionarily divergent protein sequences. They are based on local
alignments. BLOSUM matrices are based on an implicit model of evolution.
To calculate a BLOSUM matrix, the following equation is used:

Here, p{ij} is the probability of two amino acids i and j replacing each other in a
homologous sequence, and q{i} and q{j} are the background probabilities of
finding the amino acids i and j in any protein sequence. The factor lambda is a
scaling factor, set such that the matrix contains easily computable integer
values.

(b) In about 200 words, summarize the following: the ideal way to score a
multiple alignment, the Sum-of-Pairs (SP) method for scoring multiple
alignments and the minimum entropy method for scoring multiple alignments.
Give one or two shortcomings of teach of the latter approaches.
Ans. The scoring process of MSA is based on the sum of the scores of all
possible pairs of sequences in the multiple alignment according to some
scoring matrix. You can refer my previous article to learn about the different
scoring matrices and how to match them. where score(A, B) = pair-wise
alignment score of A, B. The sum-of-pairs (SP) score. It is defined on columns
and is the sum of all pairwise scores of the symbols in the column: p(a,b) is the
pairwise score for symbols a and b. we often draw conclusions about multiple
alignment by looking at the pairwise alignments.
Scoring an Alignment using minimum Entropy :
• basic idea: try to minimize the entropy of each column
• another way of thinking about it: columns that can be
communicated using few bits are good
• information theory tells us that an optimal code
uses bits to encode a message of probability p
• the messages in this case are the characters in a given column
• the entropy of a column is given by:

 Mi = the i th column of an alignment m


 Cia = count of character a in column i
 Pia = probability of character a in column i
Q4. In both the questions below we are given a sequence X of length n and
another sequence Y of length m. As a scoring model, we use a substitution
matrix s and linear gap penalties with parameter d, i.e. gamma(g)=-d*g.
(a) Global alignment with proscribed match.
We want to compute the score of the highest scoring global alignment
between X and Y among all alignments that satisfy the following extra
constraint: We are given two symbols u and v, and the alignment should never
align u with v. Give an O(nm) time algorithm to solve this problem. If you use a
dynamic programming approach, it suffices to give the equations that are
needed to compute the dynamic programming matrices.
Sol. Dynamic Programming approach for problem , we know we have symbols
u,v and they should be align globally Between X and Y so first we need to now
these special points
We apply dynamic programming when:
• There is only a polynomial number of
subproblems
– Align x1…xi
to y1…yj
• Original problem is one of the subproblems
– Align x1…xM to y1…yN
• Each subproblem is easily solved from smaller

subproblems
so now as we can see
total n itration with u and total m itration with v so comlexty will be o(nm)

(c) Aligning Y to a substring of X


We want to compute the score of a highest scoring alignment of Y to any
substring of X. In other words, the output is the score of an alignment of a
substring X' of X with the string Y, such that the score of the alignment is the
largest possible (among all choices of X'). Give an efficient dynamic
programming algorithm that solves this problem optimally in polynomial time.
It suffices to give the equations that are needed to compute the dynamic
programming matrix. What is the running-time of your algorithm?
Sol.
Let X_i be the prefix of X of length i, and let Y_j denote the prefix of Y of length j. We
compute a matrix F such that F[i][j] is the best score of an alignment of any suffix of X_i
and the string Y_j. We also compute a traceback matrix P. The computation of F and P
can be done in O(nm) time using the following equations:

F[0][0]=0
for i = 1..n: F[i][0]=0
for j = 1..m: F[0][j]=-j*d, P[0][j]=L
for i = 1..n, j = 1..m:
F[i][j] = max{ F[i-1][j-1]+s(X[i-1],Y[j-1]), F[i-1][j]-d, F[i][j-1]-d }
P[i][j] = D, T or L according to which of the three expressions above is the maximum

Once we have computed F and P, we find the largest value in the rightmost column of
the matrix F. Let F[i0][m] be that largest value. We start traceback at F[i0][m] and
continue traceback until we hit the first column of the matrix. The alignment constructed
in this way is the solution

(b) Global anchored alignment.


Assume X and Y are each at least 4 symbols long. We want to compute the
score of the highest-scoring global alignment between X and Y among all
alignments that satisfy the following extra constraint: The first two symbols of
X are always aligned with the first two symbols of Y, and the last two symbols
of X are always aligned with the last two symbols of Y. If you use a dynamic
programming approach, it suffices to give the equations that are needed to
compute the dynamic programming matrices.
the objective is to find substrings I and J that maximize the score sI J
among
all substrings I and J with J ≤ T , where T is a given
upper limit on the length of J. The objective is similar to that of the normalized
local alignment in that
it aims to circumvent the undesirable mosaic and the
shadow effects. Indirectly, an optimal alignment is
forced to have a high normalized score. The length of
subsequence J in an optimal alignment is controlled
bythe bound T . Detecting a number of important
local alignments of different horizontal lengths may
require solving a series of LRLA problems with different values of T .
Formally, given a limit T , the LRLA problem
Algorithm AP X-LRLA(δ, µ)
1. Run a modified Smith-Waterman algorithm.
If the maximum score is achieved within
horizontal length ≤ T then return
this score and exit
2. Initialization:
set LRLA∗ = 0
set S0,j,k = 0 for all j, k, 0 ≤ j ≤ m,
and 0 ≤ k ≤ T /∆ − 1
3. Main computations:
for i = 1 to n do {
set Si,0,k = 0 for all k, 0 ≤ k ≤ T /∆ − 1
for j = 1 to m do {
if (j mod ∆ = 1) then
{
set Si,j,0 = max{0, s(xi, yj ), Si−1,j,0 − µ}
set LRLA∗ = max{LRLA∗, Si,j,0}
for k = 1 to T /∆ − 1 do {
set Si,j,k = max{0, Si−1,j,k − µ,
Si−1,j−1,k−1 ⊕ s(xi, yj ), Si,j−1,k−1 − µ}
set LRLA∗ = max{LRLA∗, Si,j,k}
}
}
else
{
for k = 0 to T /∆ − 1 do {
set Si,j,k = max{0, Si−1,j,k − µ,
Si−1,j−1,k ⊕ s(xi, yj ), Si,j−1,k − µ}
set LRLA∗ = max{LRLA∗, Si,j,k}
}
}
}
}
4. Return LRLA∗

You might also like