Tutorial Note 8 Sequence Motif Basic Statistical Modeling

Tutorial Note 8
Sequence Motif
Basic Statistical Modeling
The Chinese University of Hong Kong

CSCI3220 Algorithms for Bioinformatics
TA: Chenyang HONG

06/11/2018
Agenda
• Sequence Motif and its representations
• Introduction to Statistical Modeling
• Naive Bayes Classifiers
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2

Sequence Motif and its representations
• Definition
– Small recurrent patterns
– In biological sequences
– With particular functions
• Models to represent the motifs
– Position Weight Matrix
– K-mer representation

Position Weight Matrix (PWM)
• Initially: Empirical probability for specific letters to
occur in given position
• Pseudo-counts: Add a small number for all
probabilities to prevent ignorance of possibilities
due to small sample size
• Sequence logo: Visualization
• Matching score of specific sequence: Likelihood in
PWM vs. Background probability
1 2 3 4 5 6 7 8
A 0.9 0.0 0.0 0.1 0.0 0.8 0.0 0.0
C 0.0 0.1 0.1 0.1 0.7 0.0 0.3 0.0
G 0.0 0.2 0.7 0.8 0.1 0.2 0.0 0.8
T 0.1 0.7 0.2 0.0 0.2 0.0 0.7 0.2

K-mer representation and counting
• Extract all k-mers which could occur in the
sequence motifs as “Features”
• Rich k-mers = Frequent occurrence
• g-gapped k-mers: Error tolerance
• Similarity in terms of k-mers co-occurrence on both
sequences
• Contributions of k-mers in sequence similarity:
Depends on mis-matches (By co-occurrence of g-
gapped k-mers within the sequences)

Advantages of gapped k-mer
• Better representation of flexible motif.
- Wildcard feature bring flexibility compared to regular k-mers.
- Flexible occurrence location and count of each k-mer as
compared to fairly rigid PWMs
• Longer regular k-mers would generate extremely
sparse feature vectors, which would cause
overfitting easily during training step.
• Can be used to more robustly estimate k-mer
frequencies in real biological sequences.

Question 1(a)
• Suppose you are given the following DNA
sequences:
• s1= ACCGCTAC
• s 2= ACCGCTTC
• s 3= CGCGATAC
• s4= ACTCGCAC
• Construct a position weight matrix for these four
sequences with a pseudo-count of 1 for each
nucleotide at each position.

Solution
s1= ACCGCTAC
s2= ACCGCTTC
s3= CGCGATAC
s4= ACTCGCAC
Position 1 2 3 4 5 6 7 8
Nucleotide
A 4/8 1/8 1/8 1/8 2/8 1/8 4/8 1/8
C 2/8 4/8 4/8 2/8 3/8 2/8 1/8 5/8
G 1/8 2/8 1/8 4/8 2/8 1/8 1/8 1/8
T 1/8 1/8 2/8 1/8 1/8 4/8 2/8 1/8

Question 1(b)
• If
all four nucleotides are equally likely in the
background, compute the odds of the sequence
ACGGCCAC for the motif in Part a against the
background (i.e., the data likelihood based on the
motif divided by the data likelihood based on the
background)
• The odds is = 30.

Question 2(a)
• This question is about sequence similarity. You are given
the following DNA sequences r=TGCAAGCAC and
s=GCCATAGCAC.
• Each sequence is represented by the occurrence counts of
all possible k-mers, and the similarity between two
sequences is the inner product of their k-mer count
vectors.Complete the following vectors for k=2, and
compute the similarity between r and s.
k- A A A A C C C C G G G G T T T T
mer A C G T A C G T A C G T A C G T
r
s

Solution
r=TGCAAGCAC
s=GCCATAGCAC
k- A A A A C C C C G G G G T T T T
mer A C G T A C G T A C G T A C G T
r 1 1 1 0 2 0 0 0 0 2 0 0 0 0 1 0
s 0 1 1 1 2 1 0 0 0 2 0 0 1 0 0 0
The similarity of r and s is 1+1+4+4=10
Normalize to [0,1](Optional)
𝑟‖= √ 1+1+1+4 +4+ 1= √ 12
‖ 10
𝑠‖= √ 1+1+1+4 +1+ 4+1= √ 13
‖ √12 × √13
Question 2(b)
• Is the similarity computed based on k-mers always
higher than the similarity based on (k+1)-mers, for
any k1?
• Solution:
• If the similarity based on k-mer is zero:
• Both the similarity based on k-mer and (k+1)-mer
would be zero, since the two sequences cannot
have any (k+1)-mer match.

Solution
• If the similarity based on k-mer is non-zero:
• The similarity based on k-mers is always higher
than the similarity based on (k+1)-mers.
• Proof:
• (1) Any (k+1)-mer match between the two
sequences will contains one k-mer match.
• (2) At least one (k+1)-mer match incurs more than
one k-mer match
• The similarity based on k-mers is therefore always
higher than the similarity based on (k+1)-mers.
Solution
Any (k+1)-mer match between the two 4-mer
sequences will contains one k-mer match. (a)AGAC match AGAC
4-mer
CCAGACT But GACT do not
match GACA
CCAGACT GGAGACA
4-mer
CCAGAC
GGAGACT (b)Can not generate
one more 4-mer in
the first sequence
GGAGACA
At least one (k+1)-mer match incurs more than

one k-mer match
Solution in detail
• If the similarity based on k-mer is non-zero:
• For any (k+1)-mer match between the two sequences, say between r[i..i+k]
and s[j..j+k], the heads of them must incur a k-mer match, namely between
r[i..i+k-1] and s[j..j+k-1].
• The tails of them must also incur a k-mer match, but to avoid double
counting, it should be counted only if it does not involve the beginning of
another (k+1)-mer match, namely between r[i+1..i+k+1] and s[j+1..j+k+1].
There must be at least one (k+1)-mer match not satisfying this situation,
either because r[i+1..i+k+1] and s[j+1..j+k+1] are not identical, or i+k is the
last position of r, or j+k is the last position of s. In that case, the tails of this
(k+1)-mer match would be a k-mer match not counted towards other (k+1)-
mer matches, and thus this (k+1)-mer match incurs more than one k-mer
match.
• The similarity based on k-mers is therefore always higher than the similarity
based on (k+1)-mers.

Question 3
• This
question is about g-gapped k-mers.
• (a) For DNA sequences, how many possible 2-
gapped 5-mers are there?
• There are = 21504 possible 2-gapped 5-mers.
• (b) For DNA sequence s1=GCAACGCATC, what is the
number of 2-gapped 5-mer occurrences? If a 2-gapped 5-
mer appears n times in the sequence, it is counted as n
occurrences.
• Since s1 has 10 nucleotides, it has 4 length-7 sub-sequences. Each of
them supports 7C2 = 21 2-gapped 5-mers. Therefore, the total
number of occurrences is 4(21) = 84.

Question 3(c)
• (c) What is the similarity between s1=GCAACGCATC and s2=
TGCAACGACA, defined as the inner product of their 2-
gapped 5-mer counts?
• The numbers of mismatches between the length-7 sub-sequences of
the two sequences are as follows:
s1 s2 TGCAACG GCAACGA CAACGAC AACGACA
GCAACGC 6 1 5 7
CAACGCA 6 5 2 4 Similarity: 6+1 = 7
AACGCAT 6 6 5 3
ACGCATC 6 6 5 5
• 1 mismatch: 6C1 = 6 commonly supported 2-gapped 5-mers

• 2 mismatches: 5C0 = 1 commonly supported 2-gapped 5-mer
• 3 or more mismatches: 0 commonly supported 2-gapped 5-mers

Question 3(d)
• (d) For DNA sequence s1=GCAACGCATC, how many unique
1-gapped 3-mers does it support? If a 1-gapped 3-mers
appears n times in the sequence, it is counted as one
unique 1-gapped 3-mer.
• The number of matching nucleotides for each pair of sub-sequences
is as follows:
s1[2..5]= s1[3..6]= s1[4..7]= s1[5..8]= s1[6..9]= s1[7..10]=
CAAC AACG ACGC CGCA GCAT CATC
[1..4]=GCAA 1 0 1 1 3 0
[2..5]=CAAC 1 1 1 1 3
[3..6]=AACG 1 1 0 1
[4..7]=ACGC 0 1 1
[5..8]=CGCA 0 1
[6..9]=GCAT 0

Question 3(d)
s1[2..5]= s1[3..6]= s1[4..7]= s1[5..8]= s1[6..9]= s1[7..10]=
CAAC AACG ACGC CGCA GCAT CATC
[1..4]=GCAA 1 0 1 1 3 0
[2..5]=CAAC 1 1 1 1 3
[3..6]=AACG 1 1 0 1
[4..7]=ACGC 0 1 1
[5..8]=CGCA 0 1
• In[6..9]=GCAT
order for two sub-sequences to support the same 1-gapped 0 3-
mer, they need to have at least 3 matching nucleotides. Therefore,

only two pairs have commonly supported 1-gapped 3-mers.
• In both cases, since only 3 of the 4 positions match, the two sub-
sequences only commonly support one 1-gapped 3-mer.
• GCAA, GCAT: GCA* CAAC, CATC: CA*C
• Therefore, the total number of unique 1-gapped 3-mers supported
is 7(4C1) – 2 = 26.

Introduction to Statistical Modeling
• Reasons for statistical modeling
– Provide a description of concepts based on some
observations
• “A description of concepts” refers to
– Some rules
– Some information
• Related problems in bioinformatics
– Genes
– Transcription Factor Binding Motif
– Protein Domain
– Protein Families
– … etc

Example of an Intuitive Model
• Background: we have two different types of
observations, Fever X(1), Cough X(2), and Disease Y
• Data: we have a list of observations and concept
Index Fever X(1) Cough X(2) Disease Y
1 Yes Yes Yes
2 Yes Yes Yes
3 Yes No Yes
4 Yes No Yes
5 Yes No No
6 No Yes No
7 No Yes No
8 No No No
9 No No No
10 No No Yes

Example of an Intuitive Model
• Problem: we are interested in the relations among
observations and corresponding concept
– How likely a man gets fever given that he has disease Y?
– How likely a man has disease Y when given observation
of fever(X(1)) and cough (X(2))?
• Goal: we try to evaluate the likelihood (probability)
based on the observations

Technical Problems related to Modeling
1. Given a model, what is the likelihood of the
observation?
2. Given an observation, what is the probability that a
concept is true?
3. Given some observations, what is the likelihood of a
parameter value?
4. Maximum likelihood estimation: Given a model with
unknown parameter values, what parameter values
can maximize the data likelihood?
5. Prediction of concept: Given a model and an
observation, what is the concept most likely to be true?

Question 4
• State the three probability rules required for the
basic statistical models we studied, using
mathematical equations involving random
variables X and Y.
• Rule 1 (conditional probability):
–
• Rule 2 (total probability):
– , where the summation sums over all possible values of X
• Rule 3 (Baye’s rule):
– , when

Naive Bayes Classifier
• A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem with
strong (naive) independence assumptions.
• Bayes’ theorem ( 1) (𝑛 )
( 1) Pr (𝑌 ) Pr ( 𝑋 , … , 𝑋 ∨𝑌 )
(𝑛 )
Pr ( 𝑌 ∨𝑋 , … , 𝑋 )=
Pr ( 𝑋(1)(1) , … , 𝑋(2(𝑛) ) ) (𝑛) ( 𝑛)
Pr ( 𝑌 ) Pr ( 𝑋 ∨𝑌 , 𝑋 , … , 𝑋 ) … Pr ( 𝑋 ∨𝑌 )
¿ (1) (𝑛 )
Pr ( 𝑋 , … , 𝑋 )
• Naive conditional independence assumption
( 1) (2 ) (1 )
Pr
( 𝑋 ∨𝑌 , 𝑋 )=Pr ( 𝑋 ∨𝑌 )
( 𝑋 (1) ∨𝑌 , 𝑋 (2 ) , 𝑋 ( 3) ) =Pr ( 𝑋 ( 1) ∨𝑌 )
Pr
( 𝑋 (1) ∨𝑌 , 𝑋 ( 2 ) , 𝑋 ( 3) , 𝑋 ( 4) ) =Pr ( 𝑋 ( 1) ∨𝑌 )
Pr

Example: Naive Bayes Classifier
• The doctors believe that the features fever X(1) and cough X(2)
are consequences of a new disease Y.
• First, estimate Index Fever X Cough X Disease Y (1) (2)
– Pr(X(1) = 1 | Y = 1) and 1 Yes Yes Yes

– Pr(X(2) = 0 | Y = 1) 2 Yes Yes Yes
based on the given data 3 Yes No Yes
• Using naive Bayes classifier, 4 Yes No Yes
estimate 5 Yes No No
– Pr(Y = 1 | X(1) = 1, X(2) = 0) and 6 No Yes No
– Pr(Y = 1 | X(1) = 1, X(2) = 1) 7 No Yes No
• In addition, write down all 8 No No No
parameters and a set of 9 No No No
independent parameters 10 No No Yes
sufficiently deducing all
parameters CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 26
Answer: Naive Bayes Classifier
• All parameters:
– Pr(Y = 0), Pr(Y = 1), Pr(X(1) = 0 | Y = 0), Pr(X(1) = 1 | Y = 0),
Pr(X(1) = 0 | Y = 1), Pr(X(1) = 1 | Y = 1), Pr(X(2) = 0 | Y = 0),
Pr(X(2) = 1 | Y = 0), Pr(X(2) = 0 | Y = 1), Pr(X(2) = 1 | Y = 1)
• A set of independent parameters sufficient for
inferring all parameters:
– E.g. {Pr(Y = 0), Pr(X(1) = 0 | Y = 0), Pr(X(1) = 0 | Y = 1),
Pr(X(2) = 0 | Y = 0), Pr(X(2) = 0 | Y = 1)}
– Pr(Y = 1) = 1 - Pr(Y = 0)
– Pr(X(1) = 1 | Y = 0) = 1 - Pr(X(1) = 0 | Y = 0)
– Pr(X(1) = 1 | Y = 1) = 1 - Pr(X(1) = 0 | Y = 1)Think about: How to
– Pr(X = 1 | Y = 0) = 1 - Pr(X = 0 | Y = 0) (1) calculate Pr(X = 0),
(1)
(2) (2)
Pr(X = 0, X(2) = 0 | Y = 0),
– Pr(X(2) = 1 | Y = 1) = 1 - Pr(X(2) = 0 | Y = 1)Pr(Y = 0 | X(1) = 0, X(2) = 0)
• Pr(X(1) = 1 | Y = 1) = 4/5
• Pr(X(2) = 0 | Y = 1) = 3/5 Index Fever X(1) Cough X(2) Disease Y
1 Yes Yes Yes
2 Yes Yes Yes
(1 ) (2 )
Pr
( 𝑌 =1∨𝑋 =1 , 𝑋 =0 ) 3 Yes No Yes
( 1) (2)
Pr ( 𝑋 =1∨𝑌 =1 ) Pr ( 𝑋 = 0∨𝑌 =1 ) Pr (𝑌 =1 )

¿ 1 4 Yes No Yes
(1) (2)
∑ Pr ( 𝑋 = 1∨𝑌 =𝑖 ) Pr ( 𝑋 =0∨ 𝑌 = 𝑖 ) Pr ( 𝑌 = 𝑖 )
𝑖=0 5 Yes No No
4 3 5
× ×
5 5 10 6 No Yes No
¿
1 3 5 4 3 5
× × + × × 7 No Yes No
5 5 10 5 5 10
4
¿
8 No No No
5 9 No No No
10 No No Yes

• Similarly,
Index Fever X(1) Cough X(2) Disease Y
(1 ) (2 )
Pr
( 𝑌 =1∨𝑋 =1 , 𝑋 =1 )
1 Yes Yes Yes
( 1) (2)
Pr ( 𝑋 =1∨𝑌 =1 ) Pr ( 𝑋 = 1∨𝑌 =1 ) Pr ( 𝑌 =1 )

¿ 1 2 Yes Yes Yes
∑ Pr ( 𝑋 (1)= 1∨𝑌 =𝑖 ) Pr ( 𝑋 (2) =1∨𝑌 =𝑖 ) Pr ( 𝑌 =𝑖 )
𝑖=0
4 2 5
3 Yes No Yes
× ×
5 5 10 4 Yes No Yes
¿
1 2 5 4 2 5
× × + × × 5 Yes No No
5 5 10 5 5 10
4
¿
6 No Yes No
5 7 No Yes No
8 No No No
9 No No No
10 No No Yes

Think More
• Can we estimate the term Pr(Y = 1 | X(1) = 1,X(2) = 0)
similar to estimating Pr(X(1) = 1 | Y = 1) based on
the data? Index Fever X Cough X Disease Y (1) (2)
• Example: 1 Yes Yes Yes

( 2 2 Yes Yes Yes
Pr 𝑌 =1∨𝑋 (1 )=1 , 𝑋 (2 )=0 )=
3 3 Yes No Yes
4 Yes No Yes
• Note: The value (2/3) is 5 Yes No No
6 No Yes No
different from previous 7 No Yes No
one (4/5)! 8 No No No
9 No No No
10 No No Yes

Think More
• Why these two estimated values different?
• It is the difference between modeling Pr(X|Y) and
modeling Pr(Y|X).
• When a generative model is used to infer Pr(Y|X),
it uses information of other examples to learn the
related parameters.
• For a discriminative model, it just uses the relevant
examples to estimate Pr(Y|X) directly.

Question 5
• Suppose you want to model transcriptional promoters. You are given the
following examples of promoters and background sequences:
• Promoters:
• ACCGCGTATA
• ATCGCTCCGT
• CGCTACGGTG
• TGGCGCATTA
• Background sequences:
• GTCAAGCTAG
• TACGGACTGC
• GCGATTGACG
• AATGCTCGAC
•
• You are also told that promoters occupy 0.1% of the whole genome.
Question 5(a)
• Assume all nucleotides are independent, construct
a Naïve Bayes model for classifying whether a
nucleotide is within a promoter or not, by listing all
its parameters and the corresponding values
estimated from the examples. Define all symbols
used clearly. The parameters listed should all be
independent.

Answer
• Let Y be a binary variable indicating whether a nucleotide is
within a promoter (Y=1) or not (Y=0). Let X be a discrete
variable indicating the type of the nucleotide (X=A, C, G or
T). One possible set of independent parameters is as follows:
• Pr(Y=1) = 0.001
• Pr(X=A | Y=1) = 7/40 = 0.175
• Pr(X=C | Y=1) = 12/40 = 0.3
• Pr(X=G | Y=1) = 11/40 = 0.275
• Pr(X=A | Y=0) = 10/40 = 0.25
• Pr(X=C | Y=0) = 10/40 = 0.25
• Pr(X=G | Y=0) = 12/40 = 0.3

Question 5(b)
• Use the Naïve Bayes model you constructed in Part
a to compute the probability that the whole
sequence TGCCA is within a promoter, again
assuming each nucleotide is independent of each
other.

Answer
• In order for the whole sequence to be within a
promoter, every nucleotide in the sequence should
be within a promoter. The probability is
Pr(Y=1 | X=T) Pr(Y=1 | X=G) Pr(Y=1 | X=C) Pr(Y=1 | X=C) Pr(Y=1 | X=A)
=
Pr(X=T | Y=1)Pr(Y=1) / [Pr(X=T | Y=1)Pr(Y=1) + Pr(X=T | Y=0)Pr(Y=0)]
Pr(X=G | Y=1)Pr(Y=1) / [Pr(X=G | Y=1)Pr(Y=1) + Pr(X=G | Y=0)Pr(Y=0)]
Pr(X=C | Y=1)Pr(Y=1) / [Pr(X=C | Y=1)Pr(Y=1) + Pr(X=C | Y=0)Pr(Y=0)]
Pr(X=C | Y=1)Pr(Y=1) / [Pr(X=C | Y=1)Pr(Y=1) + Pr(X=C | Y=0)Pr(Y=0)]
Pr(X=A | Y=1)Pr(Y=1) / [Pr(X=A | Y=1)Pr(Y=1) + Pr(X=A | Y=0)Pr(Y=0)]
= (0.01/8.002)(0.011/11.999)(0.012/10.002)(0.012/10.002)
(0.007/9.997)
=1.15  10-15

Check List
• What is sequence motif?
• How can we express the sequence motifs?
• How can we associate the statistics models with
problems in bioinformatics?
• Why do we learn different modeling methods?
• What is classification? Also, what is regression?
• What are the differences between generative and
discriminative modeling?

Tutorial Note 8 Sequence Motif Basic Statistical Modeling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial Note 8 Sequence Motif Basic Statistical Modeling

Uploaded by

Copyright:

Available Formats

Tutorial Note 8

The Chinese University of Hong Kong

TA: Chenyang HONG

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10

The similarity of r and s is 1+1+4+4=10

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12

At least one (k+1)-mer match incurs more than

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 16

• 1 mismatch: 6C1 = 6 commonly supported 2-gapped 5-mers

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 17

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 18

mer, they need to have at least 3 matching nucleotides. Therefore,

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 19

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 20

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 21

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 22

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 23

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 24

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 25

– Pr(X(1) = 1 | Y = 1) and 1 Yes Yes Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 28

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 29

• Example: 1 Yes Yes Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 30

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 31

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 33

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 34

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 35

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 36

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 37

You might also like