You are on page 1of 69

Information-theoretic analysis of -omics data

An introduction

David R. Bickel

University of Ottawa

17 November 2008

David Bickel (uOttawa) Information theory 17 November 2008 1 / 11
Today’s class

Di¤erential gene/protein/metabolite expression

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients
Basic: hormone or other chemical added to some cell cultures

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients
Basic: hormone or other chemical added to some cell cultures
Other examples?

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients
Basic: hormone or other chemical added to some cell cultures
Other examples?

How much information or evidence is in the measurements

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients
Basic: hormone or other chemical added to some cell cultures
Other examples?

How much information or evidence is in the measurements
for di¤erential expression?

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Today’s class

Di¤erential gene/protein/metabolite expression
Which genes express di¤erently between treatment and control?
Examples of "treatments"
Medical: drug or chemotherapy applied to some patients
Basic: hormone or other chemical added to some cell cultures
Other examples?

How much information or evidence is in the measurements
for di¤erential expression?
for equivalent expression?

David Bickel (uOttawa) Information theory 17 November 2008 2 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data
The histogram of a large expression data set resembles the true
distribution

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data
The histogram of a large expression data set resembles the true
distribution

Gene expression ratios measured by microarrays

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data
The histogram of a large expression data set resembles the true
distribution

Gene expression ratios measured by microarrays
A sample from the treatment group and a sample from the control
group are hybridized to the same microarray slide

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data
The histogram of a large expression data set resembles the true
distribution

Gene expression ratios measured by microarrays
A sample from the treatment group and a sample from the control
group are hybridized to the same microarray slide
Each gene’s expression ratio is a measurement of its expression in the
treatment group relative to its expression in the control group

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
Pick the di¤erentially expressed genes

What is di¤erential gene/protein/metabolite expression?
An average expression ratio of 1 indicates equivalent expression
Two types of di¤erential expression
An average expression ratio less than 1 indicates under-expression
An average expression ratio greater than 1 indicates over-expression
"Average expression" is over the population, not just the observed data
The histogram of a large expression data set resembles the true
distribution

Gene expression ratios measured by microarrays
A sample from the treatment group and a sample from the control
group are hybridized to the same microarray slide
Each gene’s expression ratio is a measurement of its expression in the
treatment group relative to its expression in the control group

Based on the expression data, which genes are di¤erentially expressed?

David Bickel (uOttawa) Information theory 17 November 2008 3 / 11
data set #1 data set #2 data set #4 data set #6
data (n = 3)
data (n = 6)
model (n = 3)
model (n = 6)
evidence (n = 3)
evidence (n = 6)
For each data set, indicate whether the gene is equivalently expressed (E)
or di¤erentially expressed (D) according to the plot of the data, according
to the model, and according to the evidence for each number of
observations (3 or 6). Equivalent expression means the average expression
ratio is 1.

David Bickel (uOttawa) Information theory 17 November 2008 4 / 11
Statistical models

p stands for the number of unknown parameters in a model

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

Di¤erential expression model

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

Di¤erential expression model
Unknown variability of expression

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

Di¤erential expression model
Unknown variability of expression
Unknown expression ratio

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

Di¤erential expression model
Unknown variability of expression
Unknown expression ratio
Two unknown parameters (p = 2)

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Statistical models

p stands for the number of unknown parameters in a model
Equivalent expression model
Unknown variability of expression
Expression ratio known to be 1
One unknown parameter (p = 1)

Di¤erential expression model
Unknown variability of expression
Unknown expression ratio
Two unknown parameters (p = 2)

How do the model plots change your initial assessments?

David Bickel (uOttawa) Information theory 17 November 2008 5 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better
Overly complex models make poor generalizations

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better
Overly complex models make poor generalizations
A sample of patients may not represent the population

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better
Overly complex models make poor generalizations
A sample of patients may not represent the population
A single experiment may not re‡ect typical biological processes

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better
Overly complex models make poor generalizations
A sample of patients may not represent the population
A single experiment may not re‡ect typical biological processes
Fit
= Evidence
Complexity

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Balancing complexity and …t

The di¤erential expression model (p = 2) is more complex than the
equivalent expression model (p = 1)
More complex models tend to …t data better than simple models,
even if the simple models are better
Overly complex models make poor generalizations
A sample of patients may not represent the population
A single experiment may not re‡ect typical biological processes
Fit
= Evidence
Complexity
How does balancing …t with complexity change your assessments?

David Bickel (uOttawa) Information theory 17 November 2008 6 / 11
Quality of model …t to the data

n = sample size

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Quality of model …t to the data

n = sample size
number of measured expression ratios

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Quality of model …t to the data

n = sample size
number of measured expression ratios

MSE = mean of squared errors of the model

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Quality of model …t to the data

n = sample size
number of measured expression ratios

MSE = mean of squared errors of the model
degree to which the model disagrees with the observed data (log scale)

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Quality of model …t to the data

n = sample size
number of measured expression ratios

MSE = mean of squared errors of the model
degree to which the model disagrees with the observed data (log scale)
n
1
Fit = p
MSE

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Quality of model …t to the data

n = sample size
number of measured expression ratios

MSE = mean of squared errors of the model
degree to which the model disagrees with the observed data (log scale)
n
1
Fit = p
MSE
degree to which the model …ts the observed data (assuming a normal
distribution)

David Bickel (uOttawa) Information theory 17 November 2008 7 / 11
Model complexity

n = sample size

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model
p = 2 for the di¤erential expression model

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model
p = 2 for the di¤erential expression model

p (p + 1)
pc = p +
2 (n p + 1)

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model
p = 2 for the di¤erential expression model

p (p + 1)
pc = p +
2 (n p + 1)
e¤ective number of parameters in the model (corrected for small n)

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model
p = 2 for the di¤erential expression model

p (p + 1)
pc = p +
2 (n p + 1)
e¤ective number of parameters in the model (corrected for small n)

Complexity = 2.718pc

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Model complexity

n = sample size
number of measured expression ratios

p = model dimension
number of unknown parameters in the model
p = 1 for the equivalent expression model
p = 2 for the di¤erential expression model

p (p + 1)
pc = p +
2 (n p + 1)
e¤ective number of parameters in the model (corrected for small n)

Complexity = 2.718pc
Fit
= Evidence
Complexity

David Bickel (uOttawa) Information theory 17 November 2008 8 / 11
Answers

How do our analyses compare to the truth?

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?
If a statistical method says a di¤erentially expressed gene is
equivalently expressed, is the method useless?

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?
If a statistical method says a di¤erentially expressed gene is
equivalently expressed, is the method useless?
The advantage of obtaining more data

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?
If a statistical method says a di¤erentially expressed gene is
equivalently expressed, is the method useless?
The advantage of obtaining more data

The best possible assessment given the available data

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?
If a statistical method says a di¤erentially expressed gene is
equivalently expressed, is the method useless?
The advantage of obtaining more data

The best possible assessment given the available data
How con…dent should you be in your assessments?

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
Answers

How do our analyses compare to the truth?
If a statistical method says an equivalently expressed gene is
di¤erentially expressed, is the method useless?
If a statistical method says a di¤erentially expressed gene is
equivalently expressed, is the method useless?
The advantage of obtaining more data

The best possible assessment given the available data
How con…dent should you be in your assessments?
Should you obtain more data before making an assessment?

David Bickel (uOttawa) Information theory 17 November 2008 9 / 11
The expression data sets
data set #1 data set #2 data set #4 data set #6
ratio 1 2 1 1.4
expression equivalent di¤erential equivalent di¤erential
n = 10 0.44/1.38 0.14/0.09 0.14/0.17 0.19/0.37
n = 25 0.29/0.71 0.03/0.002 4.77/1.00 0.05/0.04
1 10 4
n = 100 36/69 16/32 0.03/0.01
2 10 7

Key
n is the number of observed expression ratios.
Evidence di¤erentially expressed
Each ratio is , the weight of evidence
Evidence equivalently expressed
favoring di¤erential expression over equivalent expression.
* misleading evidence for di¤erential expression
** misleading evidence for equivalent expression
David Bickel (uOttawa) Information theory 17 November 2008 10 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements
AICc = 2 ln (Evidence)

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements
AICc = 2 ln (Evidence)
Software packages with the AIC but without the correction may be
unreliable for small numbers of observations (n < 40)

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements
AICc = 2 ln (Evidence)
Software packages with the AIC but without the correction may be
unreliable for small numbers of observations (n < 40)
Kenneth Burnham and David Anderson, Model Selection and
Multi-Model Inference

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements
AICc = 2 ln (Evidence)
Software packages with the AIC but without the correction may be
unreliable for small numbers of observations (n < 40)
Kenneth Burnham and David Anderson, Model Selection and
Multi-Model Inference

These slides and …gures will be on the lab website

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11
Further study

The method presented is based on the Akaike information criterion
(AIC) after correcting it for small numbers of measurements
AICc = 2 ln (Evidence)
Software packages with the AIC but without the correction may be
unreliable for small numbers of observations (n < 40)
Kenneth Burnham and David Anderson, Model Selection and
Multi-Model Inference

These slides and …gures will be on the lab website
www.statomics.com

David Bickel (uOttawa) Information theory 17 November 2008 11 / 11