Lecture 3: Sequence Alignments: Ly Le, PHD

Lecture 3: Sequence Alignments
Ly Le, PhD
School of Biotechnology
Email: ly.le@hcmiu.edu.vn
Office: Rm 705, HCM International University
Outline
• Definition
• Biological motivation
• Properties
• Algorithms
Definition of sequence alignment
• Sequence alignment is the procedure of comparing two (pair-wise alignment) or
more multiple sequences by searching for a series of individual characters or
patterns that are in the same order in the sequences.
• There are two types of alignment: LOCAL and GLOBAL.
- Global alignment attempts to align the entire sequence
- Local alignment concentrates on finding stretches of sequences with
high level of matches.
L G P S S K Q T G K G S - S R I W D N
Global alignment
(Needelman-Wunch algorithm)
L N --- I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - -
Local alignment
(Smith-Waterman algorithm)
Biological Motivation
• Inference of Homology
– Two genes are homologous if they share a
common evolutionary history.
– Evolutionary history can tell us a lot about
properties of a given gene
– Homology can be inferred from similarity between
the genes
• Searching for Proteins with same or similar
functions
• Predict mutations
Why do we need local alignments?
Activities:
• To compare a short sequence to a large one.
• To compare a single sequence to an entire
database
• To compare a partial sequence to the whole.
Purposes:
• Identify newly determined sequences
• Compare new genes to known ones
• Guess functions for entire genomes full of ORFs
of unknown function
5
WHAT IS IMPORTANT IN THE PROTEIN SIMILARITY SEARCH ?

W
H
A
TI
SI
MP
OR
T
A
NT
I
NT
HE
P
RO
T
EI
NS
I
MI
LA
RI
T
YSE
A
RC
H
?
1) Contribution (%) of identical positions
W H AT IS IMP O RT AN T IN T H
E
P
R
OT
E
I
N
SI
M
I
LA
RI
T
YS
E
AR
C
H
?
1)
C ont
ri
b u
ti
on (
% )o
fident
icalpo
sition
s
P
K
IL
ME
C
KK
D8 P
K
IL
ME
C
KK
D2
1
)
C
on
t
r
ib
u
t
i
o
P
Kn
(
%
I
L
M)
o
K
Cf
i
d
e
K
H
Dn
t
i
c
8a
l
p
0
%o
s
i
ti
o
ns
SD
C
LL
DC
V
CL2
0
%
W
H
A
T
I
S
IM
P
OR
T
A
N
TI
N
TH
EP
R
O
T
EI
N
SI
M
I
LA
RI
T
YS
E
AR
C
H
?
si
mi
l
ar n
ot
si
mi
l
ar
PKIL
ME
CKK
D8 PKILM
EC
K2
K
D
PKIL
MK
CKH
D80
% S DCLL
DC
V2
C
L0
%
2
)
Le
ng
t
ho
f
1
)t
h
e
C
oc
no
m
t
r
i
bp
a
u
tr
e
i
od
n
ss
t
r
(
%
i
m
i
li
n
)
o
a
rg
s
f
i(
s
e
d
e
nq
u
t
i
ce
n
c
a
l
pe
s
)
o
s
it
i
o
ns n
ot
si
m
i
la
r
2) Length of the
LCE
compared
1
strings
MV EICI
(sequences)
E PK
IRCIKVCT K
D
ER
IT
C
LI
LD
E
T8
2 )L
e n
gth
oft
hecomP
pK
a I
rL
e
dME
s
trC
i
nK
gK
sD
(
se8
q
uenc
es
) P
KI
LM
E
CK
K2
D
WCG 3
3.3
% PM
KVIY
LW
MC
KPR
CKR
HF
M
DHC
8V
0HLK
% SA G
G
CT
C
DW
C
L
C
LR
L
L
D
CD
Y
Y
V
C2
2
L6
%
0
%
ca
s
ua
l pr
o
ba
bl
ys
im
i
l
ar
L
CE1s i
m
i
Ml
a
r
V
EIC
IEP
KIR
CIK
VCn
T
K
Do
t
s
E
Ri
m
i
l
I
Ta
r
C
LI
L
D
E8
T
W
CG33
.3
% MV
YWC
PRR
FMH
CVH
LKA
G
GC
TC
WC
LR
L
D
Y2
6
Y%
3
)
Di
s
tr
i
bu
t
i
2
)o
n
L
e
c
a
o
n
s
f
t
h
g
t
h
u
a
l
e
oi
d
e
f
tn
h
et
i
c
ca
l
o
mp
o
s
p
a
ri
t
i
o
e
dn
ss
a
t
r
i
nl
o
n
g
sg
(
p
r
o
t
s
e
b
h
e
q
u
a
b
l
y
a
n
e
n
s
a
l
y
c
e
s
i
m
i
l
z
e
)
a
r
d
s
eq
u
en
c
e
M
VE MI
CIE P
KIR
C IK
LV
C
CT
EK
1DERI T
L 5 MVEIM
CV
IE
EM
I
PM
A
KG
ID
RA
CR
IC
KI
K
VV
CCT
TK
KD
E
DER
I
RT
IC
TL
C5
LILDE8
T
3)D i
s
tri
buti
onofthei d
ent
ic
a lpos
iti
onsal
on
g t
heanal
yze
d s
eque
nce
H
VY YW
R P
E R
FMH
T VK
WL
K
CA
GG
3GC
3R
.
3C
% W
L 20% MVYWH
CH
PY
RY
W
RM
A
FG
MD
HA
CH
VT
HV
Q
LL
KKA
AG
GG
C
GCW
C
TW
CA
WG
C2
L0
%
RLDY2
Y6
%
c
as
ua
lc si
mil
ar
3) Distribution
MVofM
E the
I
CIEPidentical
KIa
Rs
u
C
Ia
KlVCTKpositions
DER
IT L 5 alongM
pthe
r
o
Vb
a
Eb
Ml
Ianalyzed
y
s
Mi
m
A
Gi
l
Da
Ar
RCI
KVCsequence
TKD
ERI
TCL5
HVYYW
RPERFMHT
VK LKAGGCR
CW L 20
% H
HYYWMA
GDAHTV
QLKAGG
CWC
WAG2 0
%
4
)R es
idues
a
3
)t
Dc
io
sn
t
r
c
a
s
i
s
e
b
u
u
a
r
v
t
l
a
i
ot
ni
ve
op
f o
t
hs
i
et
i
o
i
dn
enst
ica
l p
osit
i
ons
alongt
he
s
ia
mn
ia
l
al
ry
zeds
eque
nce
M
V
CPK
ILMKC
KHDSD
CLL
DCVC
LEDMVC
P
KIL
MK
CKH
DS
DTLL
D
CVCL
E
D
4)R
eM
V
sE
M
i
d
uI
eC
I
s
aE
tP
K
c
oI
R
n
sC
e
rI
K
v
aV
C
t
i
vT
K
eD
E
p
o
sR
I
i
tT
L
i
o
n5M
s VE
M
IM
AG
DA
RC
IK
VC
T
KD
ERI
T
C5
L
E
D
EGK
RRTKR
EHFKE
SNL
AAAF
KEQQNC
P
GPR
EW
CFT
TR
MNDS
S
CACP
Q
T
H
VY
YWR
PER
FM
HTV
KL
KA
GG
CR
CW
L2
0% H
HY
Y
WM
AG
DA
HT
VQ
LK
A
GG
CWC
W
A2
G0
%
no
tsi
mi
l
ar s
im
i
la
r
MVCPK
ILMKc
a
C
Ks
u
Ha
l
DS
DCL
LD
CVC
LE
D MV
CPK
IL
Ms
Ki
m
i
l
C
Ka
r
H
DSD
TL
LDC
VC
L
ED
EDEGK
RRTKR
EHFK
ESN
LA
AAF
KE
Q QN
CPG
PR
EWC
FT
TRM
ND
SSC
AC
P
QT
5
)
St
r
uct
u
ral
/
ge
net
i
csi
mi
la
r
it
yo
ft
h
eam
in
oa
c
id
s
at
no
n-
c
on
se
r
va
ti
v
epo
s
i
ti
o
ns
4
)Re
si
d
nu
o
te
s
sa
t
i
m
ic
o
l
an
s
re
rv
at
i
vep
o
si
t
i
on
s si
m
i
la
r
MV
CP
K
I
LM
KC
K
HD
SD
CL
LD
CV
CL
E
D M
VC
PK
IL
M
K
CK
H
DS
D
TL
LD
CI
d
e
V
C
L
En
t
i
Dt
yo
nl
y
5
)
S
t
ru
c
t
ur
a
l
/
ge
n
e
t
Mi
c
s
V
Ci
m
i
P
Kl
a
r
i
I
Lt
y
o
M
Kf
t
h
C
Ke
a
m
H
D
Si
n
o
D
Ca
c
i
L
L
Dd
s
a
C
Vt
n
o
C
L
En
-
c
Do
n
s
e
rv
a
t
i
v
ep
os
i
t
i
on
s
E
D
E
GK
RR
TK
RE
H
FKE
S
NL
A
AA
FK
E
Q QN
CP
G
PR
EW
CF
T
TRM
N
D
SS
C
A
CP
Q
T
nR
L
o
t
sC
R
i
m
i
lR
L
a
rV
K
RC
RK
E
TE
CI
VE
C
IC
ID
E s
im
i
l
a
r
I
de
n
t
it
y
o
nl
y
S
t
ru
c
tu
r
a
l M
VC
P
KI
L
MK
CK
H
DS
DC
LL
DC
V
C
G
e
n
eL
E
t
i
cD
2
)
L
en
g
th
o
f
th
e
c
om
p
a
re
d
s
tr
i
n
gs
(s
e
q
ue
n
ce
s
)
P
KI
LMECK
KD8 P
KI
LMEC K
KD2
L
C
E1P
KI
LMKCM
K
HV
E
I
DC
I
8
0E
P
%K
IR
CI
KV
C
S
DT
K
C
LLD
E
DR
C I
T
V
CC
L
LI
L
2
0D
E
%8
T
WHAT IS IMPORTANT IN THE PROTEIN SIMILARITY SEARCH ?
W
C
G3
3
.3
%s
i
mil
arMV
Y
WC
PR
RF
MH
CV
HL
K
nA
G
o
t
si
mG
C
iT
l
arC
WC
LR
LD
YY2
6
%
c
a
s
u
al pr
o
ba
b
l
ys
i
m
il
a
r

2
)
L
3
)e
D
in
g
t
s
t
r
ih
o
b
uf
t
t
i
oh
e
nc
o
m
o
f
t
hp
ea
r
e
i
d
ed
s
n
t
it
r
c
ai
n
lg
p
os
(
s
is
e
t
i
oq
u
n
se
n
c
a
le
s
o
n)
g
the
a
n
al
y
z
ed
se
q
u
en
c
e
L C
E 1 MVE
I C IEPKI
RCIKV CT
KD
ER
IT
CL
IL
DE
T8
M
VE MICIEP
KIR
CIKVCTKDERIT
L 5M VEMIMAG
DA
R
CI
K
VC
TK
D
ER
IT
C
L5
W C
G 33
.3% MVYWC PRRFM
HCVHL KA
GG
CT
CW
CL
RL
DY
Y26
%
4) Residues H
at
VY the
YWRP
cconservative
E
aR
F
s
u
c
a
M
a
l
s
u
H
T
a
V
l
KLKAGGCRpositions
CW
L 20 % p
rob
a
bH
H
l
yY
sY
i
mW
iM
l
aA
rG
D
s
i
A
H
T
m
i
l
a
V
Q
L
r
K
AG
G
CW
CW
A
G2
0%

3
)
Dis
t
ri
but
i
ono
f
th
ei
de
nt
i
cal
pos
i
ti
ons
al
ong
t
hea
n
al
yz
ed
se
q
ue
nc
e
4
)R
es
id
uesa
tc
o
ns
er
v
at
iv
epo
si
t
io
ns
M
V
E
M
VM
I
C
PC
I
E
K
IP
K
L
MI
R
KC
CI
K
KV
C
H
D
ST
K
DD
E
R
C
LI
T
L
D5
L
C
VC
LE
DM
V
E
MM
I
V
CM
A
G
P
KD
A
I
LR
MC
KI
CK
V
C
KT
K
H
DD
E
R
S
DI
T
T
LC
L
L
D5
C
VC
LE
D
H
V
Y
E
DY
W
E
GR
P
E
K
RR
F
R
TM
H
KT
RV
K
EL
K
H
F
KA
G
EG
C
R
S
NC
W
L
A2
L
A
A0
%
F
KE
QH
H
Y
QY
W
N
CM
A
G
P
GD
A
P
RH
ET
WV
CQ
L
K
FA
G
T
TG
C
W
R
MC
W
N
DA
G
S
S2
0
C
A%
C
PQ
T
c
as
u
n
oa
l
ts
i
mi
l
a
r s
i
mil
ar
s
im
i
la
r
4
)
R
5
)e
S
ts
i
d
r
u
cu
t
ue
s
r
aa
t
l
/
gc
o
e
nn
s
e
e
t
ir
v
a
c
st
i
i
mv
e
i
lp
o
a
r
i
ts
i
t
y
oi
o
fn
t
hs
ea
mi
n
oa
ci
d
sa
tn
o
n
-c
o
n
se
r
v
a
ti
v
ep
os
i
t
i
on
s
M VCPKIL M
KCKH DS DC L
LD
CVCLEDM VCPK ILMKCK HD
SDT LLDC VCLED
Ide
nti
tyonl
y
E DEGKRR T
KREH FK ES N
LA
AAFKEQQ NCPG PREWCF TT
RMN DSSC ACPQT
MV
CP
KI LM
KC KHDSDCL LD CVCLED
5) Structural/geneticn
similarity
otsi
mil
ar of the amino acids at non-conservative
RL
CR
RL VK
RC RKETECI VE CI
s
i
C
m
I
i
l
D
a
r
E positions
5
)
St
ru
ct
ura
l
/
ge
Sn
e
t
r
u
ct
i
c
t
urs
i
m
a
lil
ar
it
yof
th
ea
mi
noa
c
id
sa
t
non
-
co
n
Gs
e
r
e
nv
a
e
tt
i
v
i
cep
o
si
t
ion
s
M
V
C
PK
I
LM
K
C
KH
D
SD
C
LL
DC
VC
LE
D MV
CPK
IL
MK
CK
HD
SD
C
L
LD
C
VC
L
E
D
R
L
C
RR
L
VK
R
C
RK
E
TE
C
IV
EC
IC
II
D
Ed
e
nt
i
ty
Ro
n
l
L
Cy
RR
LV
KR
CR
KE
TE
C
I
VE
C
IC
I
D
E
M
VC
PK
IL
MK
CK
HDSD
CLL
DC
VC
LE
D
R
LC
RR
LV
KR
CR
KETE
CIV
EC
IC
ID
E
S
tr
u
ct
ur
al G
e
ne
t
ic
M
VC
PK
IL
MK
CK
HDSDCL
LD
CV
CL
EDM
VC
PK
IL
MK
CK
HD
SDC
LL
DC
VC
LE
D
R
LC
RR
LV
KR
CR
KETECI
VE
CI
CI
DER
LC
RR
LV
KR
CR
KE
TEC
IV
EC
IC
ID
E
Genetic Code.
Identical vs. similar
Alignment algorithm
• Dot matrix analysis
• Dynamic algorithms
• FASTA
• BLAST
• Gapped BLAST
• PSI BLAST
Dot matrix analysis
• A dot matrix analysis is a method for comparing two sequences to

look for possible alignment (Gibbs and McIntyre 1970)
• One sequence (A) is listed across the top of the matrix and the
other (B) is listed down the left side
• Starting from the first character in B, one moves across the page
keeping in the first row and placing a dot in many column where the
character in A is the same
• The process is continued until all possible comparisons between A
and B are made
• Any region of similarity is revealed by a diagonal row of dots
• Isolated dots not on diagonal represent random matches
Dot matrix analysis (cont’)
• Detection of matching regions can be improved by filtering out

random matches and this can be achieved by using a sliding window
• It means that instead of comparing a single sequence position
more positions is compared at the same time and dot is printed only
if a certain minimal number of matches occur
• Dot matrix analysis can also be used to find direct and inverted
repeats within the sequences
Dot matrix analysis: two identical sequences
• Nucleic Acids Dot Plots -
http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.htm
l
Dot matrix analysis: two very different sequences sequences
• Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse
•http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dot matrix analysis: two similar sequences sequences
• Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)
Dot matrix analysis: two similar sequences sequences; size of the
sliding window increased
• Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)
Dynamic programming algorithm for sequence
alignment
• The method compares every pair of characters in the two sequences and
generates an alignment, which is the best or optimal.
• This is a highly computationally demanding method. However the latest
algorithmic improvements and ever increasing computer capacity make
possible to align a query sequence against a large DB in a few minutes.
• Each alignments has its own score and it is essential to recognise that several
different alignments may have nearly identical scores, which is an indication
that the dynamic programming methods may produce more than one optimal
alignment. However intelligent manipulation of some parameters is important
and may discriminate the alignments with similar scores.
• Global alignment program is based on Needleman-Wunsch algorithm and
local alignment on Smith-Waterman. Both algorithms are derivates from the
basic dynamic programming algorithm.
Description of the dynamic programming algorithm
• The alignment procedure depends upon scoring system, which can be based
on probability that 1) a particular amino acid pair is found in alignments of
related proteins (pxy); 2) the same amino acid pair is aligned by chance (pxpy); 3)
introduction of a gap would be a better choice as it increases the score.
• The ratio of the first two probabilities is usually provided in an amino acid
substitution matrix. There are many such matrices, two of them PAM and
BLOSUM are considered later.
• The score for the gap introduction and its extension is also calculated from
the matrices and represent a prior knowledge and some assumptions. One of
them is quite simple, if negative cost of a gap is too high a reasonable
alignment between slightly different sequences will be never achieved but if it
is too low an optimal alignment is hardly possible. Other assumptions are
based on sophisticated statistical procedures.
Description of the dynamic programming algorithm
• Consider building this alignment in steps, starting from the initial match
(V/V) and then sequentially adding a new pair until the alignment is complete,
at each stage choosing a pair from all the possible matches that provides the
highest score for the alignment up to that point.
• If the full alignment has the highest possible (or optimal) score, then the old
alignment from which it was derived (A) by addition of the aligned Y/Y pair
must also have been optimal up to that point in the alignment.
• In this manner, the alignment can be traced back to the first aligned pair that
was also an optimal alignment.
• The example, which we have considered, illustrates 3 choices: 1. Match the
next character(s) in the following position(s); 2. Match the next character(s) to
a gap in the upper sequence; 3. Add a gap in the lower sequence.
Formal description of dynamic programming algorithm
i -x
Si - x,j - wx
Si –1, j- 1 + s(ai , bj)
i -1
i
Si, j - y - wy Si, j
i -y j -1 j
• This diagram indicates the moves that are possible to reach a certain position (i,j) starting from
the previous row and column at position (i -1, j-1) or from any position in the same row or column
• Diagonal move with no gap penalties or move from any other position from column j or row i, with
a gap penalty that depends on the size of the gap
Formal description of dynamic programming algorithm
For two sequences a = a1, a2,..ai and b = b1, b2, ..bj, where Sij = S ( a1,…ai, b1,…bj) then
Sij = max { Si – 1, j – 1 + s(aibj),

max (Si – x, j - wx),
x1
max (Si j- y - wx),
y1
}
where Sij is the score at position at i in sequence a and j in sequence b, s(aibj) is score
for aligning the character at positions i and j, wx is the penalty for a gap of length x in
sequence a, and wx is the penalty for a gap of length y in sequence b.
Note that Sij is a type of running best score as the algorithm moves through every
position in the matrix
Alignment A: a1 a2 a3 a4
b1 b2 b3 b4
Alignment B: a1 a2 a3 a4 -
b1 - b2 b3 b4
The highest scoring matrix position

is located (in this case s44) and then
traced back as far as possible,
generating the path shown
Dynamic Algorithms – Global
Alignment
Example:match = 1, mismatch = -1, gap = -2
• ABC
• - BB
1 A 2 B 3 C 4
1 0 + -2 -4 -6
B * +
2 -2 -1 -1 -3
B * +
3 -4 -3 0 -2
C *
4 -6 -5 -2 1
Rules
• V(x, y) is the value of the optimal global alignment
that ends at positions x, y
• Base condition:
– V(0, 0) = 0
• Recursive condition:
– V(x, y) = max of
• V(x-1, y-1) + s(Seq1(x), Seq2(y))
…match or mismatch
• V(x-1, y) + gap penalty
• V(x, y-1) + gap penalty
Dynamic Algorithms
Example:
• match = 1, mismatch = -1, gap = -2
1 A 2 B 3 C 4
1 0 0 0 0
B
2 0 0 1 0
B *
3 0 0 1 0
C *
4 0 0 0 2
Local Alignment Dynamic Algorithm
• V(x, y) is the value of the optimal local alignment
that ends at positions x, y
• Base condition:
– V(x, 0) = V(0, y) = 0
• Recursive condition:
– V(x, y) = max of
• 0
• V(x-1, y-1) + s(Seq1(x), Seq2(y))
…match or mismatch
• V(x-1, y) + gap penalty
• V(x, y-1) + gap penalty
Scoring matrices
• It is critical to have reasonable scoring schemes accepted by the scientific
community for DNA and proteins and for different types of alignments
• The wealth of information accumulated in the gene/protein banks was
utilised with dynamic programming procedure to create such matrices for
scoring matches and separately penalties for gaps introduction and extensions
• Matrices for DNA are rather similar as there are only two options purine &
pyrimidine and match & mismatch
• Proteins are much more complex and the number of option is significant
• PAM and other matrices are represented in log odds scores, which is the
ratio of chance of amino acid substitution due to essential biological reason to
the chance of random substitution
• There are many different PAMs, which are representing different
evolutionary scenarios. PAM 250 represents a level of 250% of changes
expected in 2500 MY
• PAM is more suitable for studying quite distant proteins, BLOSUM is for more
conserved proteins of domains
Scoring matrices: PAM (Percent Accepted Mutation)
Amino acids are grouped according to to the chemistry of the side group: (C) sulfhydryl, (STPAG)-small
hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and (FYW)
aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the probability
are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is 10+10=20,
whereas YY/TP is –3-5=-8, a rare and unexpected between homologous sequences.
Scoring matrices: BLOSUM62
(BLOcks amino acid SUbstitution Matrices)
Ideology of BLOSUM is similar but it is calculated from a very different and much larger set of
proteins, which are much more similar and create blocks of proteins with a similar pattern
Scoring a sequence alignment with a gap
penalty
Sequence 1 V D S - C Y
Sequence 2 V E S L C Y
Score 4 2 4 -11 9 7
Score = sum of amino acid pair scores (26)

minus single gap penalty (11) = 15
As two sequences may differ, it is likely to have non-identical amino

acids placed in the corresponding positions. In order to optimise the
alignment gap(s) may be introduced, which may reflect losses or
insertions, which occurred in the past in the sequences. Introduction
of gaps causes penalties. Scores gained by each match are not always
the same, for instance two rare amino acids will score more than two
common.
Derivation of the dynamic programming
algorithm
1. Score of new = Score of previous + Score of new
alignment alignment (A) aligned pair
V D S - C Y V D S - C Y
V E S L C Y V E S L C Y
15 = 8 + 7
2. Score of = Score of previous + Score of new
alignment (A) alignment (B) aligned pair
V D S - C V D S - C
V E S L C V E S L C
8 = -1 + 9
3. Repeat removing aligned pairs until end of alignments is reached
Old Homework#2
PROTEIN
1) What is name, function of protein which has
Uniprot accession number P28845?
2) BLAST search and colect all of its homologous
sequences
Gene and Genome
Obtain the human ACBP mRNA sequence with NCBI Accession #
M15887 and do a MegaBLAST search against the human genome
(all assemblies) database (Use the "genome" database). Ignore the
hits to “alternate assemblies”.
1.Which chromosomes gave BLAST hits in your search? Which hits are
significant? (ie similarities that you think are important and are
evolutionarily related to the ACBP sequence)? Describe what your
criteria for significance is for each chromosomal
2.Describe the similarity detected between the human ACBP mRNA and
the BLAST hit on chromosome 6. Provide the Expect value, number of
mismatches, number of gaps, and length of alignment. Provide an
explanation for the relationship between the chromosome 6 sequence
and ACBP
33
3. Describe the similarity detected between the human
ACBP mRNA and the BLAST hit on chromosome 2.
Provide the Expect value, number of mismatches, number of
gaps, and length of alignment. Provide an explanation for
the relationship between the chromosome 2 sequence and
ACBP.
4. Which sequence is the gene from which your ACBP

mRNA sequence was derived, give the accession number?
Discuss the differences between the BLAST hit on
chromosome 6 and chromosome 2.
34
Do the same BLAST search as above with the human ACBP
mRNA but use the BLASTN search strategy instead of
MegaBLAST.
5. Which chromosomes gave BLAST hits in your search? Which

hits are significant? (ie similarities that you think are important and
are evolutionarily related to the ACBP sequence). Describe what
your criteria for significance is for each chromosomal hit?
6. Describe the differences in the results from the BLASTN and

MegaBLAST searches. Which search would you use to identify
distantly related sequences?
7. What are the five different BLAST programs? Briefly describe

them in your own words, indicating the type of query sequence and
the type of database searched?
35

Lecture 3: Sequence Alignments: Ly Le, PHD

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3: Sequence Alignments: Ly Le, PHD

Uploaded by

Copyright:

Available Formats

Lecture 3: Sequence Alignments

• A dot matrix analysis is a method for comparing two sequences to

• Detection of matching regions can be improved by filtering out

Sij = max { Si – 1, j – 1 + s(aibj),

The highest scoring matrix position

Score = sum of amino acid pair scores (26)

As two sequences may differ, it is likely to have non-identical amino

4. Which sequence is the gene from which your ACBP

5. Which chromosomes gave BLAST hits in your search? Which

6. Describe the differences in the results from the BLASTN and

7. What are the five different BLAST programs? Briefly describe

You might also like