Professional Documents
Culture Documents
expression
Networks
1
Gene co-expression networks
Gene Expression Data
• Identify co-expressed
genes
• Classify new datasets
• Discover regulatory
networks
5
Distance Metrics
3.5
2.5
Expression
1.5
0.5
0
0 2 4 6
Time
6
Expression data as
multidimensional vectors
expression
in experiment
1
7
Euclidean
• XA,k = Expression of gene A in condition k
expression in expt 2
XB,2 B
A
XA,2
XA,1 XB,1
expression in expt 1 8
Distance
• Metrics for distance have a formal
definition:
– d(x, y) ≥ 0
– d(x, y) = 0 if and only if x = y
– d(x, y) = d(y, x)
– Triangle inequality:
d(x, z) ≤ d(x, y) + d(y, z)
• This triangle inequality need not hold for a
measure of “similarity.”
• Distance ~ Dissimilarity = 1 - similarity
9
Distance Metrics
Can we capture the similarity of
3.5
these patterns?
3
2.5
Expression
1.5
0.5
0
0 2 4 6
Time
10
Pearson Correlation
11
Pearson Correlation
• Xi,j = Expression of gene i in condition j
• Zi = z-score of gene i one experiment:
• Pearson correlation
over all experiments
12
3.5 1.5
3 1
Expression
2.5
0.5
Z-score
2
0
0 1 2 3 4 5 6
1.5
-0.5
A ZA
1
C -1
ZC
0.5
-1.5
0
0 1 2 3 4 5 6 -2
13
3.5
2
ZA
3
1.5 ZB
A ZC
Expression
2.5 1
B
Z-score
2 C
0.5
1.5 0
0 1 2 3 4 5 6
1 -0.5
-1
0.5
-1.5
0
0 1 2 3 4 5 6
-2
RA,B= -0.01
RA,C= 0.999
RB,C= -0.03
14
4 2
ZA
3 1.5 ZB
ZD
Expression
2 1
Z-score
1 0.5
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
-1 A
-0.5
B
-2 D -1
-3 -1.5
-4 -2
RA,B= -0.01
RA,D= -1.0
RB,D= 0.007
15
Distance Metrics
3.5
2.5
Expression
1.5
1
1-
0.5
0
0 2 4 6
Time
16
Missing Data
• What if a particular data point is missing?
(Back in the old days: there was a bubble or a
hair on the array)
– ignore that gene in all samples
– ignore that sample for all genes
– replace missing value with a constant
– “impute” a value
• example: compute the K most similar genes (arrays)
using the available data; set the missing value to the
mean of that for these K genes (arrays)
17
Clustering
• Multiple mechanisms could lead to up-
regulation in any one condition
• Goal: Find genes that have “similar”
expression over many condition.
• How do you define “similar”?
18
Clustering
• Intuitive idea that we want to find an
underlying grouping
• In practice, this can
be hard to define and
implement.
• An example of unsupervised learning
19
Clustering 8600 human genes based on time course
of expression following
serum stimulation of fibroblasts
© American Association for the Advancement of Science. All rights reserved. This content is excluded
from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Source: Iyer, Vishwanath R., Michael B. Eisen, et al. "The Transcriptional Program in the Response
of Human Fibroblasts to Serum." Science 283, no. 5398 (1999): 83-7. Iyer et al. Science 1999
20
Why cluster?
21
Hierarchcial clustering
Two types of approaches:
•Agglomerative
•Divisive
© source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
22
Agglomerative Clustering Algorithm
• Step 1: Initialization: Each data point is in its
own cluster
• Step 2: Merge the two most similar clusters.
• Step 3: Repeat step 2 until there is only one
cluster.
23
Agglomerative Clustering Algorithm
24
• Clusters Y, Z with A in Y and B in Z
• Single linkage = min{dY,Z}
• Complete linkage = max{dY,Z}
• UPGMC (Unweighted Pair Group Z
B
Method using Centroids Y
A
– Define distance as
• UPGMA (Unweighted Pair Group
Method with Arithmetic Mean)
average of pairwise distances:
25
• Single linkage = min{dY,Z}
• Complete linkage = max{dY,Z}
26
• If clusters exist and are compact, it should not
matter.
• Single linkage will “chain” together groups
with one intermediate point.
• Complete linkage will not combine two groups
if even one point is distant.
27
Interpreting the Dendogram
• This produces a binary tree or
dendrogram
• The final cluster is the root and
each data item is a leaf
• The heights of the bars indicate
how close the items are
• Can ‘slice’ the tree at any
distance cutoff to produce
Distance
discrete clusters
• Dendogram represents the
results of the clustering; its
usefulness in representing the
data is mixed.
• The results will always be
hierarchical, even if the data are
Data items (genes, etc.) not.
28
K-means clustering
• Advantage: gives sharp partitions of the data
• Disadvantage: need to specify the number of
clusters (K).
• Goal: find a set of k clusters that minimizes
the distances of each point in the cluster to
the cluster mean:
29
K-means clustering algorithm
• Initialize: choose k points as cluster means
• Repeat until convergence:
– Assignment: place each point Xi in the cluster
with the closest mean.
– Update: recalculate the mean for each cluster
30
31
32
33
What if you choose the wrong K?
34
35
36
37
Big steps occur when we are dividing
data into natural clusters
38
What if we choose pathologically
bad initial positions?
39
40
41
42
43
44
45
46
47
48
What if we choose pathologically
bad initial positions?
Often, the algorithm gets a
reasonable answer, but not always!
49
Convergence
• K-means always converges.
• The assignment and update steps always
either reduce the objective function or leave it
unchanged.
50
Convergence
• However, it doesn’t always find the same
solution.
K=2
51
Fuzzy K-means
K=2
52
K-means Fuzzy k-means
• Initialize: choose k • Initialize: choose k
points as cluster means points as cluster means
• Repeat until • Repeat until
convergence: convergence:
– Assignment: place each – Assignment: calculate
point Xi in the cluster probability of each point
with the closest mean. belonging to each
cluster.
– Update: recalculate the – Update: recalculate the
mean for each cluster mean for each cluster
using these probabilities
76
K-means Fuzzy k-means
56
K-medoids clustering
• Initialize: choose k points as cluster means
• Repeat until convergence:
– Assignment: place each point Xi in the cluster
with the closest medoid.
– Update: recalculate the medoid for each cluster
57
So What?
• Clusters could reveal underlying biological
processes not evident from complete list of
differentially expressed genes
• Clusters could be co-regulated. How could we
find upstream factors?
58
Reconstructing Regulatory Networks
59
Clustering vs. “modules”
• Clusters are purely phenomenological – no
claim of causality
• The term “module” is used to imply a more
mechanistic connection
Transcription Transcription
factor A factor B
Correlated expression
60
61
Courtesy of Macmillan Publishers Limited. Used with permission.
Source: Marbach, Daniel, James C. Costello, et al. "Wisdom of Crowds for
Robust Gene Network Inference." Nature Methods 9, no. 8 (2012): 796-804.
64
Bayesian Networks
65
Bayesian Networks
• Bayesian Networks are a tool for reasoning
with probabilities
• Consist of a graph (network) and a set of
probabilities
• These can be “learned” from the data
Bayes’ theorem
• Bayes’ theorem (Bayes’ law or Bayes’ rule)
describes the probability of an event based
on prior knowledge of conditions that might
be related to the event.
Bayes’ theorem
• or
𝑃(𝐴,𝐵)=𝑃(𝐴|𝐵)𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)
• or
𝑃(𝐴,𝐵)=𝑃(𝐴|𝐵)𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)
An example of a DAG.
Bayesian Network (BN)
• A class of graphic models that consist of two parts, <G, P> or
<G, ϴ>
• P is a set of conditional probability distributions (CPD), one for each
node conditional on its parents.
• It follows the Markov property of BN: each node is conditionally
independent of its ancestors given its parents.
An example of a DAG.
Bayesian Network (BN)
• A class of graphic models that consist of two parts, <G, P> or
<G, ϴ>
• P is a set of conditional probability distributions (CPD), one for each
node conditional on its parents.
• CPD for node A: 𝑃(A|B,C)
• CPD for node B: 𝑃(B|F)
• CPD for node C: P(C)
• What are the CPD for nodes D, E, F and G?
An example of a DAG.
Bayesian Network (BN)
• A class of graphic models that consist of two parts, <G, P> or
<G, ϴ>
• P is a set of conditional probability distributions (CPD), one for each
node conditional on its parents.
• Given the Markov property, the joint probability distribution of
random variables in G can be decomposed into a product of
conditional probability distributions across all nodes.
𝑃(A,B,C,D,E,F,G,H)=𝑃(A|B,C) 𝑃(B|F) P(C) 𝑃(D|A,G) P(E|A) P(F) P(G) P(H|E)
• ϴ is a set of parameters in P.
An example of a DAG.
Bayesian Network (BN)
• A class of graphic models that consist of two parts, <G, P> or
<G, ϴ>
• ϴ is a set of parameters in P
• Number of parameters in Bayesian network means the number of
conditional probability entries in the network (as shown below).
• ϴ is a set of parameters in P.
An example of a DAG.
Bayesian Network (BN)
• A class of graphic models that consist of two parts, <G, P> or
<G, ϴ>
• ϴ is a set of parameters in P
• Number of parameters equals
• n: number nodes
• 𝒓𝒊 is the level of node i
• 𝒒𝒊 is the number of value combinations of node 𝑖’s parents
Suppose all nodes have binary values, that their level are all 2
Suppose all nodes have binary values, that is their level are all 2
𝑚𝑎𝑥𝑐 𝑃(𝑇=𝑐|𝑋1=𝑥1,𝑋2=𝑥2,⋯,𝑋𝑛−1=𝑥𝑛−1)
Parameter Learning
• Example, a BN with all node having YES or NO
Parameter Learning
• Example, a BN with all node having YES or NO
Suppose we observe an event (B=Yes, C=Yes, D=Yes, E=No, F=Yes, G=No, H=Yes)
What is P(A=YES)?
Parameter Learning
• Example, a BN with all node having YES or NO
𝑃(𝐴=𝑌|B=Y,C=Y,D=Y,E=N,F=Y,G=N,H=Y) Product of conditional probability
∝ 𝑃(𝐴=𝑌,B=Y,C=Y,D=Y,E=N,F=Y,G=N,H=Y) for each node
=𝑃(𝐴=𝑌|𝐵=𝑌,𝐶=𝑌) 𝑃(B=Y|F=Y)P(C=Y) P(D=Y|A=Y,G=N) P(E=N|A=Y) P(F=Y)
P(G=N) P(H=Y|E=N)
=0.5×0.43×0.59×0.8×(1−0.67)×0.68×(1−0.59)×0.23
=0.002;
𝑃(𝐴=𝑁|B=Y,C=Y,D=Y,E=N,F=Y,G=N,H=Y) ∝𝑃(𝐴=𝑁,B=Y,C=Y,D=Y,E=N,F=Y,G=N,H=Y)
=𝑃(𝐴=𝑁|𝐵=𝑌,𝐶=𝑌)𝑃(B=Y|F=Y)P(C=Y)P(D=Y|A=N,G=N)P(E=N|A=N)P(F=Y)P(G=N)
P(H=Y|E=N)
=(1−0.5)×0.43×0.59×0.5×(1−0.1)×0.68×(1−0.59)×0.23
=0.004.
Suppose we observe an event (B=Yes, C=Yes, D=Yes, E=No, F=Yes, G=No, H=Yes)
What is P(A=YES)?
Parameter Learning
• Example, a BN with all node having YES or NO
Ancestor (Sub)Graph: a subgraph of the Bayes Net where only variables of interest
and their ancestors are drawn
Ancestor Graph Shortcut: Any variable that is not in the "ancestor" graph for the set
of variables of interest we can remove from the summation.
Parameter Learning
• Example, a BN with all node having YES or NO
P(A=Yes | F=Yes)
= σ𝐵,𝐶 𝑃 𝐴 = 𝑌𝑒𝑠 𝐵, 𝐶 𝑃(𝐶)𝑃(𝐵|𝐹 = 𝑌𝑒𝑠)
87
Is the p53 pathway activated?
Possible Evidence
•Known p53 targets are up-regulated
•Could another pathway also cause this?
•Genes for members of signaling pathway are
expressed (ATM, ATR, CHK1, …)
•Might be true under many conditions
where pathway has not yet been activated
•Genes for members of signaling pathway are
differentially expressed
•Still does not prove change in activity
Courtesy of Looso et al. License: CC-BY.
Source: Looso, Mario, Jens Preussner, et al. "A De Novo Assembly of the Newt Transcriptome Combined with Proteomic
Validation Identifies New Protein Families Expressed During Tissue Regeneration." Genome Biology 14, no. 2 (2013): R16.
88
Is the p53 pathway activated?
• Formulate problem probabilistically
• Compute
– P(p53 pathway activated| data)
• How?
– Relatively easy to compute p(X up | TF up)
– How?
TF
X1 X2 X3 89
Is the p53 pathway activated?
• Formulate problem probabilistically
• Compute
– P(p53 pathway activated| data)
• How?
– Relatively easy to compute p(X up | TF up)
– Look over lots of experiments and tabulate:
• X1 up & TF up
• X1 up & TF not up
• X1 not up & TF not up
• X1 not up & TF up TF
X1 X2 X3 90
Is the p53 pathway activated?
• Formulate problem probabilistically
• Compute
– P(p53 pathway activated| data)
• How?
– Relatively easy to compute p(X up | TF up)
– P(TF up|X up) = p(X up | TF up) p(TF up)/p(X up)
TF
X1 X2 X3 310
Is the p53 pathway activated?
• Formulate problem probabilistically
• Compute
– P(p53 pathway activated| data)
• How?
– Even with p(TF up | X up) how do we compare this
explanation of the data to other possible
explanations?
– Can we include upstream data? TF
X1 X2 X3
Application to Gene Networks
TF A1 TF B1
•Which pathway activated
this set of genes?
•Either A or B or both would
produce similar but not
identical results.
TF A2 TF B2
•Bayes Nets estimate
conditional probability tables
from lots of gene expression
data.
•How often is TF B2
expressed when TF B1 is
expressed, etc.
93
Learning Models from Data
• Searching for the BN structure: NP-complete
– Too many possible structures to evaluate all of
them, even for very small networks.
– Many algorithms have been proposed
– Incorporated some prior knowledge can reduce
the search space.
• Which nodes should regulate transcription?
• Which should cause changes in phosphorylation?
– Intervention experiments help
94
© American Association for the Advancement of Science. All rights reserved. This content is excluded
from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Source: Sachs, Karen, Omar Perez, et al. "Causal Protein-signaling Networks Derived From
Multiparameter Single-cell Data." Science 308, no. 5721 (2005): 523-9.
95
Fig. 1. Bayesian network modeling with single-celldata
© American Association for the Advancement of Science. All rights reserved. This content is excluded
from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Source: Sachs, Karen, Omar Perez, et al. "Causal Protein-signaling Networks Derived From
Multiparameter Single-cell Data." Science 308, no. 5721 (2005): 523-9.
97
Regression-based models
© source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
Predicted expression : Yg = f g ( X Tg ) +
Assume that expression of gene Xg is some function of the
expression of its transcription factors X Tg = {X t ,t Tg }
Xi = measured expression of i-th gene
XTi = measured expression of a set of TFs potentially
regulating gene i
fg is an arbitrary function
ϵ is noise
98
Regression-based models
© source unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
f g ( XTg ) = t ,g X t
tTg
100
Regression-based models
M N
102
Quick Review of Information Theory
1
Information content I (E ) = log 2
of an event E P (E )
© Califon Productions, Inc. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.
103
Quick Review of Information Theory
1
Information content I (E ) = log 2
of an event E P (E )
H (S ) = pi I (si ) = pi log2
1
Entropy is evaluated
over all possible i i pi
outcomes
104
Mutual Information
• Does knowing variable X reduce the uncertainty
in variable Y?
• Example:
– P(Rain) depends on P(Clouds)
– P(target expressed) depends on P(TF expressed)
105
Mutual information detects
non-linear relationships
107
ARACNe
• Find TF-target relationships using mutual
information
108
ARACNE
• Data processing inequality
– Eliminate indirect interactions G2
– If G2 regulates G1,G3
I(G1,G3)>0 but adds no insight
– Remove edge with smallest
mutual information in G1 G3
each triple
109
Limitations
• Need huge expression datasets
• Can’t find:
– modulator that do not change in expression
– modulator that are highly correlated with target
– modulators that both activate and repress
110
Huge networks!
This is just the
nearest
neighbors of
one node of
interest from
ARACNe!
Nature
Medicine 18, 436–
440 (2012) doi:10.1038/n
m.2610
111
Outline
• Gene expression
– Distance metrics
– Clustering
–Modules
• Bayesian networks
• Regression
• Mutual Information
• Evaluation on real and simulated data
112
113
Area under precision-recall curve AUPR = area under precision-recall curve
119
Approach
mRNA levels do not predict protein levels
Protein expression levels
(molecules/cell, log-scale base 10)
Raquel de Sousa Abreu, Luiz Penalva, Edward Marcotte and Christine Vogel, Mol. BioSyst., 2009 DOI: 10.1039/b908315d
Source: Ning, Kang, Damian Fermin, et al. "Comparative Analysis of Different Label-free Mass
Spectrometry Based Protein Abundance Estimates and Their Correlation with RNA-Seq
Gene Expression Data." Journal of Proteome Research 11, no. 4 (2012): 2261-71.
Kang Ning, Damian Fermin, and Alexey I. Nesvizhskii J Proteome Res. 2012 April 6; 11(4): 2261–2271.
120