Professional Documents
Culture Documents
Sequence Motif
Basic Statistical Modeling
Nucleotide
A 4/8 1/8 1/8 1/8 2/8 1/8 4/8 1/8
C 2/8 4/8 4/8 2/8 3/8 2/8 1/8 5/8
G 1/8 2/8 1/8 4/8 2/8 1/8 1/8 1/8
T 1/8 1/8 2/8 1/8 1/8 4/8 2/8 1/8
r
s
r 1 1 1 0 2 0 0 0 0 2 0 0 0 0 1 0
s 0 1 1 1 2 1 0 0 0 2 0 0 1 0 0 0
Normalize to [0,1](Optional)
𝑟‖= √ 1+1+1+4 +4+ 1= √ 12
‖ 10
𝑠‖= √ 1+1+1+4 +1+ 4+1= √ 13
‖ √12 × √13
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11
Question 2(b)
• Is the similarity computed based on k-mers always
higher than the similarity based on (k+1)-mers, for
any k1?
• Solution:
• If the similarity based on k-mer is zero:
• Both the similarity based on k-mer and (k+1)-mer
would be zero, since the two sequences cannot
have any (k+1)-mer match.
4-mer
CCAGACT But GACT do not
match GACA
CCAGACT GGAGACA
4-mer
CCAGAC
GGAGACT (b)Can not generate
one more 4-mer in
the first sequence
GGAGACA
• In[6..9]=GCAT
order for two sub-sequences to support the same 1-gapped 0 3-
• Background sequences:
• GTCAAGCTAG
• TACGGACTGC
• GCGATTGACG
• AATGCTCGAC
•
• You are also told that promoters occupy 0.1% of the whole genome.
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 32
Question 5(a)
• Assume all nucleotides are independent, construct
a Naïve Bayes model for classifying whether a
nucleotide is within a promoter or not, by listing all
its parameters and the corresponding values
estimated from the examples. Define all symbols
used clearly. The parameters listed should all be
independent.
= (0.01/8.002)(0.011/11.999)(0.012/10.002)(0.012/10.002)
(0.007/9.997)
=1.15 10-15