You are on page 1of 37

Tutorial Note 8

Sequence Motif
Basic Statistical Modeling

The Chinese University of Hong Kong


CSCI3220 Algorithms for Bioinformatics

TA: Chenyang HONG


06/11/2018
Agenda
• Sequence Motif and its representations
• Introduction to Statistical Modeling
• Naive Bayes Classifiers

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 2


Sequence Motif and its representations
• Definition
– Small recurrent patterns
– In biological sequences
– With particular functions
• Models to represent the motifs
– Position Weight Matrix
– K-mer representation

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 3


Position Weight Matrix (PWM)
• Initially: Empirical probability for specific letters to
occur in given position
• Pseudo-counts: Add a small number for all
probabilities to prevent ignorance of possibilities
due to small sample size
• Sequence logo: Visualization
• Matching score of specific sequence: Likelihood in
PWM vs. Background probability
1 2 3 4 5 6 7 8
A 0.9 0.0 0.0 0.1 0.0 0.8 0.0 0.0
C 0.0 0.1 0.1 0.1 0.7 0.0 0.3 0.0
G 0.0 0.2 0.7 0.8 0.1 0.2 0.0 0.8
T 0.1 0.7 0.2 0.0 0.2 0.0 0.7 0.2

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 4


K-mer representation and counting
• Extract all k-mers which could occur in the
sequence motifs as “Features”
• Rich k-mers = Frequent occurrence
• g-gapped k-mers: Error tolerance
• Similarity in terms of k-mers co-occurrence on both
sequences
• Contributions of k-mers in sequence similarity:
Depends on mis-matches (By co-occurrence of g-
gapped k-mers within the sequences)

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 5


Advantages of gapped k-mer
• Better representation of flexible motif.
- Wildcard feature bring flexibility compared to regular k-mers.
- Flexible occurrence location and count of each k-mer as
compared to fairly rigid PWMs
• Longer regular k-mers would generate extremely
sparse feature vectors, which would cause
overfitting easily during training step.
• Can be used to more robustly estimate k-mer
frequencies in real biological sequences.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 6


Question 1(a)
• Suppose you are given the following DNA
sequences:
• s1= ACCGCTAC
• s 2= ACCGCTTC
• s 3= CGCGATAC
• s4= ACTCGCAC
• Construct a position weight matrix for these four
sequences with a pseudo-count of 1 for each
nucleotide at each position.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 7


Solution
s1= ACCGCTAC
s2= ACCGCTTC
s3= CGCGATAC
s4= ACTCGCAC
Position  1 2 3 4 5 6 7 8

Nucleotide
A 4/8 1/8 1/8 1/8 2/8 1/8 4/8 1/8
C 2/8 4/8 4/8 2/8 3/8 2/8 1/8 5/8
G 1/8 2/8 1/8 4/8 2/8 1/8 1/8 1/8
T 1/8 1/8 2/8 1/8 1/8 4/8 2/8 1/8

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 8


Question 1(b)
• If
  all four nucleotides are equally likely in the
background, compute the odds of the sequence
ACGGCCAC for the motif in Part a against the
background (i.e., the data likelihood based on the
motif divided by the data likelihood based on the
background)
• The odds is = 30.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 9


Question 2(a)
• This question is about sequence similarity. You are given
the following DNA sequences r=TGCAAGCAC and
s=GCCATAGCAC.
• Each sequence is represented by the occurrence counts of
all possible k-mers, and the similarity between two
sequences is the inner product of their k-mer count
vectors.Complete the following vectors for k=2, and
compute the similarity between r and s.
k- A A A A C C C C G G G G T T T T
mer A C G T A C G T A C G T A C G T

r
s

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 10


Solution
r=TGCAAGCAC
s=GCCATAGCAC
k- A A A A C C C C G G G G T T T T
mer A C G T A C G T A C G T A C G T

r 1 1 1 0 2 0 0 0 0 2 0 0 0 0 1 0

s 0 1 1 1 2 1 0 0 0 2 0 0 1 0 0 0

The similarity of r and s is 1+1+4+4=10

Normalize to [0,1](Optional)
  𝑟‖= √ 1+1+1+4 +4+ 1= √ 12
‖   10
  𝑠‖= √ 1+1+1+4 +1+ 4+1= √ 13
‖ √12 × √13
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 11
Question 2(b)
• Is the similarity computed based on k-mers always
higher than the similarity based on (k+1)-mers, for
any k1?
• Solution:
• If the similarity based on k-mer is zero:
• Both the similarity based on k-mer and (k+1)-mer
would be zero, since the two sequences cannot
have any (k+1)-mer match.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 12


Solution
• If the similarity based on k-mer is non-zero:
• The similarity based on k-mers is always higher
than the similarity based on (k+1)-mers.
• Proof:
• (1) Any (k+1)-mer match between the two
sequences will contains one k-mer match.
• (2) At least one (k+1)-mer match incurs more than
one k-mer match
• The similarity based on k-mers is therefore always
higher than the similarity based on (k+1)-mers.
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 13
Solution
Any (k+1)-mer match between the two 4-mer
sequences will contains one k-mer match. (a)AGAC match AGAC

4-mer
CCAGACT But GACT do not
match GACA

CCAGACT GGAGACA
4-mer

CCAGAC
GGAGACT (b)Can not generate
one more 4-mer in
the first sequence
GGAGACA

At least one (k+1)-mer match incurs more than


one k-mer match
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 14
Solution in detail
• If the similarity based on k-mer is non-zero:
• For any (k+1)-mer match between the two sequences, say between r[i..i+k]
and s[j..j+k], the heads of them must incur a k-mer match, namely between
r[i..i+k-1] and s[j..j+k-1].
• The tails of them must also incur a k-mer match, but to avoid double
counting, it should be counted only if it does not involve the beginning of
another (k+1)-mer match, namely between r[i+1..i+k+1] and s[j+1..j+k+1].
There must be at least one (k+1)-mer match not satisfying this situation,
either because r[i+1..i+k+1] and s[j+1..j+k+1] are not identical, or i+k is the
last position of r, or j+k is the last position of s. In that case, the tails of this
(k+1)-mer match would be a k-mer match not counted towards other (k+1)-
mer matches, and thus this (k+1)-mer match incurs more than one k-mer
match.
• The similarity based on k-mers is therefore always higher than the similarity
based on (k+1)-mers.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 15


Question 3
• This
  question is about g-gapped k-mers.
• (a) For DNA sequences, how many possible 2-
gapped 5-mers are there?
• There are = 21504 possible 2-gapped 5-mers.
• (b) For DNA sequence s1=GCAACGCATC, what is the
number of 2-gapped 5-mer occurrences? If a 2-gapped 5-
mer appears n times in the sequence, it is counted as n
occurrences.
• Since s1 has 10 nucleotides, it has 4 length-7 sub-sequences. Each of
them supports 7C2 = 21 2-gapped 5-mers. Therefore, the total
number of occurrences is 4(21) = 84.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 16


Question 3(c)
• (c) What is the similarity between s1=GCAACGCATC and s2=
TGCAACGACA, defined as the inner product of their 2-
gapped 5-mer counts?
• The numbers of mismatches between the length-7 sub-sequences of
the two sequences are as follows:
s1 s2 TGCAACG GCAACGA CAACGAC AACGACA
GCAACGC 6 1 5 7
CAACGCA 6 5 2 4 Similarity: 6+1 = 7
AACGCAT 6 6 5 3
ACGCATC 6 6 5 5

• 1 mismatch: 6C1 = 6 commonly supported 2-gapped 5-mers


• 2 mismatches: 5C0 = 1 commonly supported 2-gapped 5-mer
• 3 or more mismatches: 0 commonly supported 2-gapped 5-mers

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 17


Question 3(d)
• (d) For DNA sequence s1=GCAACGCATC, how many unique
1-gapped 3-mers does it support? If a 1-gapped 3-mers
appears n times in the sequence, it is counted as one
unique 1-gapped 3-mer.
• The number of matching nucleotides for each pair of sub-sequences
is as follows:
s1[2..5]= s1[3..6]= s1[4..7]= s1[5..8]= s1[6..9]= s1[7..10]=
CAAC AACG ACGC CGCA GCAT CATC
[1..4]=GCAA 1 0 1 1 3 0
[2..5]=CAAC 1 1 1 1 3
[3..6]=AACG 1 1 0 1
[4..7]=ACGC 0 1 1
[5..8]=CGCA 0 1
[6..9]=GCAT 0

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 18


Question 3(d)
s1[2..5]= s1[3..6]= s1[4..7]= s1[5..8]= s1[6..9]= s1[7..10]=
CAAC AACG ACGC CGCA GCAT CATC
[1..4]=GCAA 1 0 1 1 3 0
[2..5]=CAAC 1 1 1 1 3
[3..6]=AACG 1 1 0 1
[4..7]=ACGC 0 1 1
[5..8]=CGCA 0 1

• In[6..9]=GCAT
order for two sub-sequences to support the same 1-gapped 0 3-

mer, they need to have at least 3 matching nucleotides. Therefore,


only two pairs have commonly supported 1-gapped 3-mers.
• In both cases, since only 3 of the 4 positions match, the two sub-
sequences only commonly support one 1-gapped 3-mer.
• GCAA, GCAT: GCA* CAAC, CATC: CA*C
• Therefore, the total number of unique 1-gapped 3-mers supported
is 7(4C1) – 2 = 26.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 19


Introduction to Statistical Modeling
• Reasons for statistical modeling
– Provide a description of concepts based on some
observations
• “A description of concepts” refers to
– Some rules
– Some information
• Related problems in bioinformatics
– Genes
– Transcription Factor Binding Motif
– Protein Domain
– Protein Families
– … etc

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 20


Example of an Intuitive Model
• Background: we have two different types of
observations, Fever X(1), Cough X(2), and Disease Y
• Data: we have a list of observations and concept
Index Fever X(1) Cough X(2) Disease Y
1 Yes Yes Yes
2 Yes Yes Yes
3 Yes No Yes
4 Yes No Yes
5 Yes No No
6 No Yes No
7 No Yes No
8 No No No
9 No No No
10 No No Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 21


Example of an Intuitive Model
• Problem: we are interested in the relations among
observations and corresponding concept
– How likely a man gets fever given that he has disease Y?
– How likely a man has disease Y when given observation
of fever(X(1)) and cough (X(2))?
• Goal: we try to evaluate the likelihood (probability)
based on the observations

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 22


Technical Problems related to Modeling
1. Given a model, what is the likelihood of the
observation?
2. Given an observation, what is the probability that a
concept is true?
3. Given some observations, what is the likelihood of a
parameter value?
4. Maximum likelihood estimation: Given a model with
unknown parameter values, what parameter values
can maximize the data likelihood?
5. Prediction of concept: Given a model and an
observation, what is the concept most likely to be true?

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 23


Question 4
•  State the three probability rules required for the
basic statistical models we studied, using
mathematical equations involving random
variables X and Y.
• Rule 1 (conditional probability):

• Rule 2 (total probability):
– , where the summation sums over all possible values of X
• Rule 3 (Baye’s rule):
– , when

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 24


Naive Bayes Classifier
• A naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem with
strong (naive) independence assumptions.
• Bayes’ theorem ( 1) (𝑛 )
  ( 1) Pr (𝑌 ) Pr ( 𝑋 , … , 𝑋 ∨𝑌 )
(𝑛 )
Pr ( 𝑌 ∨𝑋 , … , 𝑋 )=
Pr ( 𝑋(1)(1) , … , 𝑋(2(𝑛) ) ) (𝑛) ( 𝑛)
  Pr ( 𝑌 ) Pr ( 𝑋 ∨𝑌 , 𝑋 , … , 𝑋 ) … Pr ( 𝑋 ∨𝑌 )
¿ (1) (𝑛 )
Pr ( 𝑋 , … , 𝑋 )
• Naive conditional independence assumption
( 1) (2 ) (1 )
Pr
  ( 𝑋 ∨𝑌 , 𝑋 )=Pr ( 𝑋 ∨𝑌 )
  ( 𝑋 (1) ∨𝑌 , 𝑋 (2 ) , 𝑋 ( 3) ) =Pr ( 𝑋 ( 1) ∨𝑌 )
Pr
  ( 𝑋 (1) ∨𝑌 , 𝑋 ( 2 ) , 𝑋 ( 3) , 𝑋 ( 4) ) =Pr ( 𝑋 ( 1) ∨𝑌 )
Pr

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 25


Example: Naive Bayes Classifier
• The doctors believe that the features fever X(1) and cough X(2)
are consequences of a new disease Y.
• First, estimate Index Fever X Cough X Disease Y (1) (2)

– Pr(X(1) = 1 | Y = 1) and 1 Yes Yes Yes


– Pr(X(2) = 0 | Y = 1) 2 Yes Yes Yes
based on the given data 3 Yes No Yes
• Using naive Bayes classifier, 4 Yes No Yes
estimate 5 Yes No No
– Pr(Y = 1 | X(1) = 1, X(2) = 0) and 6 No Yes No
– Pr(Y = 1 | X(1) = 1, X(2) = 1) 7 No Yes No
• In addition, write down all 8 No No No
parameters and a set of 9 No No No
independent parameters 10 No No Yes
sufficiently deducing all
parameters CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 26
Answer: Naive Bayes Classifier
• All parameters:
– Pr(Y = 0), Pr(Y = 1), Pr(X(1) = 0 | Y = 0), Pr(X(1) = 1 | Y = 0),
Pr(X(1) = 0 | Y = 1), Pr(X(1) = 1 | Y = 1), Pr(X(2) = 0 | Y = 0),
Pr(X(2) = 1 | Y = 0), Pr(X(2) = 0 | Y = 1), Pr(X(2) = 1 | Y = 1)
• A set of independent parameters sufficient for
inferring all parameters:
– E.g. {Pr(Y = 0), Pr(X(1) = 0 | Y = 0), Pr(X(1) = 0 | Y = 1),
Pr(X(2) = 0 | Y = 0), Pr(X(2) = 0 | Y = 1)}
– Pr(Y = 1) = 1 - Pr(Y = 0)
– Pr(X(1) = 1 | Y = 0) = 1 - Pr(X(1) = 0 | Y = 0)
– Pr(X(1) = 1 | Y = 1) = 1 - Pr(X(1) = 0 | Y = 1)Think about: How to
– Pr(X = 1 | Y = 0) = 1 - Pr(X = 0 | Y = 0) (1) calculate Pr(X = 0),
(1)
(2) (2)
Pr(X = 0, X(2) = 0 | Y = 0),
– Pr(X(2) = 1 | Y = 1) = 1 - Pr(X(2) = 0 | Y = 1)Pr(Y = 0 | X(1) = 0, X(2) = 0)
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 27
Answer: Naive Bayes Classifier
• Pr(X(1) = 1 | Y = 1) = 4/5
• Pr(X(2) = 0 | Y = 1) = 3/5 Index Fever X(1) Cough X(2) Disease Y
1 Yes Yes Yes
2 Yes Yes Yes
(1 ) (2 )
Pr
  ( 𝑌 =1∨𝑋 =1 , 𝑋 =0 ) 3 Yes No Yes
( 1) (2)
Pr ( 𝑋 =1∨𝑌 =1 ) Pr ( 𝑋 = 0∨𝑌 =1 ) Pr (𝑌 =1 )
 
¿ 1 4 Yes No Yes
(1) (2)
∑ Pr ( 𝑋 = 1∨𝑌 =𝑖 ) Pr ( 𝑋 =0∨ 𝑌 = 𝑖 ) Pr ( 𝑌 = 𝑖 )
𝑖=0 5 Yes No No
4 3 5
  × ×
5 5 10 6 No Yes No
¿
1 3 5 4 3 5
× × + × × 7 No Yes No
5 5 10 5 5 10
  4
¿
8 No No No
5 9 No No No
10 No No Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 28


Answer: Naive Bayes Classifier
• Similarly,
Index Fever X(1) Cough X(2) Disease Y
(1 ) (2 )
Pr
  ( 𝑌 =1∨𝑋 =1 , 𝑋 =1 )
1 Yes Yes Yes
( 1) (2)
Pr ( 𝑋 =1∨𝑌 =1 ) Pr ( 𝑋 = 1∨𝑌 =1 ) Pr ( 𝑌 =1 )
 
¿ 1 2 Yes Yes Yes
∑ Pr ( 𝑋 (1)= 1∨𝑌 =𝑖 ) Pr ( 𝑋 (2) =1∨𝑌 =𝑖 ) Pr ( 𝑌 =𝑖 )
𝑖=0
4 2 5
3 Yes No Yes
  × ×
5 5 10 4 Yes No Yes
¿
1 2 5 4 2 5
× × + × × 5 Yes No No
5 5 10 5 5 10
  4
¿
6 No Yes No
5 7 No Yes No
8 No No No
9 No No No
10 No No Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 29


Think More
• Can we estimate the term Pr(Y = 1 | X(1) = 1,X(2) = 0)
similar to estimating Pr(X(1) = 1 | Y = 1) based on
the data? Index Fever X Cough X Disease Y (1) (2)

• Example: 1 Yes Yes Yes


  ( 2 2 Yes Yes Yes
Pr 𝑌 =1∨𝑋 (1 )=1 , 𝑋 (2 )=0 )=
3 3 Yes No Yes
4 Yes No Yes
• Note: The value (2/3) is 5 Yes No No
6 No Yes No
different from previous 7 No Yes No
one (4/5)! 8 No No No
9 No No No
10 No No Yes

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 30


Think More
• Why these two estimated values different?
• It is the difference between modeling Pr(X|Y) and
modeling Pr(Y|X).
• When a generative model is used to infer Pr(Y|X),
it uses information of other examples to learn the
related parameters.
• For a discriminative model, it just uses the relevant
examples to estimate Pr(Y|X) directly.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 31


Question 5
• Suppose you want to model transcriptional promoters. You are given the
following examples of promoters and background sequences:
• Promoters:
• ACCGCGTATA
• ATCGCTCCGT
• CGCTACGGTG
• TGGCGCATTA

• Background sequences:
• GTCAAGCTAG
• TACGGACTGC
• GCGATTGACG
• AATGCTCGAC
•  
• You are also told that promoters occupy 0.1% of the whole genome.
CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 32
Question 5(a)
• Assume all nucleotides are independent, construct
a Naïve Bayes model for classifying whether a
nucleotide is within a promoter or not, by listing all
its parameters and the corresponding values
estimated from the examples. Define all symbols
used clearly. The parameters listed should all be
independent.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 33


Answer
• Let Y be a binary variable indicating whether a nucleotide is
within a promoter (Y=1) or not (Y=0). Let X be a discrete
variable indicating the type of the nucleotide (X=A, C, G or
T). One possible set of independent parameters is as follows:
• Pr(Y=1) = 0.001
• Pr(X=A | Y=1) = 7/40 = 0.175
• Pr(X=C | Y=1) = 12/40 = 0.3
• Pr(X=G | Y=1) = 11/40 = 0.275
• Pr(X=A | Y=0) = 10/40 = 0.25
• Pr(X=C | Y=0) = 10/40 = 0.25
• Pr(X=G | Y=0) = 12/40 = 0.3

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 34


Question 5(b)
• Use the Naïve Bayes model you constructed in Part
a to compute the probability that the whole
sequence TGCCA is within a promoter, again
assuming each nucleotide is independent of each
other.

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 35


Answer
• In order for the whole sequence to be within a
promoter, every nucleotide in the sequence should
be within a promoter. The probability is
Pr(Y=1 | X=T) Pr(Y=1 | X=G) Pr(Y=1 | X=C) Pr(Y=1 | X=C) Pr(Y=1 | X=A)
=
Pr(X=T | Y=1)Pr(Y=1) / [Pr(X=T | Y=1)Pr(Y=1) + Pr(X=T | Y=0)Pr(Y=0)]
Pr(X=G | Y=1)Pr(Y=1) / [Pr(X=G | Y=1)Pr(Y=1) + Pr(X=G | Y=0)Pr(Y=0)]
Pr(X=C | Y=1)Pr(Y=1) / [Pr(X=C | Y=1)Pr(Y=1) + Pr(X=C | Y=0)Pr(Y=0)]
Pr(X=C | Y=1)Pr(Y=1) / [Pr(X=C | Y=1)Pr(Y=1) + Pr(X=C | Y=0)Pr(Y=0)]
Pr(X=A | Y=1)Pr(Y=1) / [Pr(X=A | Y=1)Pr(Y=1) + Pr(X=A | Y=0)Pr(Y=0)]

= (0.01/8.002)(0.011/11.999)(0.012/10.002)(0.012/10.002)
(0.007/9.997)
=1.15  10-15

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 36


Check List
• What is sequence motif?
• How can we express the sequence motifs?
• How can we associate the statistics models with
problems in bioinformatics?
• Why do we learn different modeling methods?
• What is classification? Also, what is regression?
• What are the differences between generative and
discriminative modeling?

CSCI3220 Algorithms for Bioinformatics Tutorial Notes | Fall 2018 37

You might also like