The slides of the lecture "Information-theoretic analysis of -omics data," delivered 17 November 2008 in BIO 5106 (BIOL 5506) BIOINFORMATICS.

Dec 07, 2008

Dec 07, 2008

An introduction

David R. Bickel

University of Ottawa

17 November 2008

Today’s class

Which genes express di¤erently between treatment and control?

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Basic: hormone or other chemical added to some cell cultures

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Basic: hormone or other chemical added to some cell cultures

Other examples?

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Basic: hormone or other chemical added to some cell cultures

Other examples?

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Basic: hormone or other chemical added to some cell cultures

Other examples?

for di¤erential expression?

Which genes express di¤erently between treatment and control?

Examples of "treatments"

Medical: drug or chemotherapy applied to some patients

Basic: hormone or other chemical added to some cell cultures

Other examples?

for di¤erential expression?

for equivalent expression?

Pick the di¤erentially expressed genes

An average expression ratio of 1 indicates equivalent expression

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

The histogram of a large expression data set resembles the true

distribution

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

The histogram of a large expression data set resembles the true

distribution

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

The histogram of a large expression data set resembles the true

distribution

A sample from the treatment group and a sample from the control

group are hybridized to the same microarray slide

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

The histogram of a large expression data set resembles the true

distribution

A sample from the treatment group and a sample from the control

group are hybridized to the same microarray slide

Each gene’s expression ratio is a measurement of its expression in the

treatment group relative to its expression in the control group

An average expression ratio of 1 indicates equivalent expression

Two types of di¤erential expression

An average expression ratio less than 1 indicates under-expression

An average expression ratio greater than 1 indicates over-expression

"Average expression" is over the population, not just the observed data

The histogram of a large expression data set resembles the true

distribution

A sample from the treatment group and a sample from the control

group are hybridized to the same microarray slide

Each gene’s expression ratio is a measurement of its expression in the

treatment group relative to its expression in the control group

data set #1 data set #2 data set #4 data set #6

data (n = 3)

data (n = 6)

model (n = 3)

model (n = 6)

evidence (n = 3)

evidence (n = 6)

For each data set, indicate whether the gene is equivalently expressed (E)

or di¤erentially expressed (D) according to the plot of the data, according

to the model, and according to the evidence for each number of

observations (3 or 6). Equivalent expression means the average expression

ratio is 1.

Statistical models

Equivalent expression model

Equivalent expression model

Unknown variability of expression

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Unknown variability of expression

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Unknown variability of expression

Unknown expression ratio

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Unknown variability of expression

Unknown expression ratio

Two unknown parameters (p = 2)

Equivalent expression model

Unknown variability of expression

Expression ratio known to be 1

One unknown parameter (p = 1)

Unknown variability of expression

Unknown expression ratio

Two unknown parameters (p = 2)

Balancing complexity and …t

equivalent expression model (p = 1)

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

Overly complex models make poor generalizations

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

Overly complex models make poor generalizations

A sample of patients may not represent the population

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

Overly complex models make poor generalizations

A sample of patients may not represent the population

A single experiment may not re‡ect typical biological processes

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

Overly complex models make poor generalizations

A sample of patients may not represent the population

A single experiment may not re‡ect typical biological processes

Fit

= Evidence

Complexity

equivalent expression model (p = 1)

More complex models tend to …t data better than simple models,

even if the simple models are better

Overly complex models make poor generalizations

A sample of patients may not represent the population

A single experiment may not re‡ect typical biological processes

Fit

= Evidence

Complexity

How does balancing …t with complexity change your assessments?

Quality of model …t to the data

n = sample size

n = sample size

number of measured expression ratios

n = sample size

number of measured expression ratios

n = sample size

number of measured expression ratios

degree to which the model disagrees with the observed data (log scale)

n = sample size

number of measured expression ratios

degree to which the model disagrees with the observed data (log scale)

n

1

Fit = p

MSE

n = sample size

number of measured expression ratios

degree to which the model disagrees with the observed data (log scale)

n

1

Fit = p

MSE

degree to which the model …ts the observed data (assuming a normal

distribution)

Model complexity

n = sample size

n = sample size

number of measured expression ratios

n = sample size

number of measured expression ratios

p = model dimension

n = sample size

number of measured expression ratios

p = model dimension

number of unknown parameters in the model

n = sample size

number of measured expression ratios

p = model dimension

number of unknown parameters in the model

p = 1 for the equivalent expression model

n = sample size

number of measured expression ratios

p = model dimension

number of unknown parameters in the model

p = 1 for the equivalent expression model

p = 2 for the di¤erential expression model

n = sample size

number of measured expression ratios

p = model dimension

number of unknown parameters in the model

p = 1 for the equivalent expression model

p = 2 for the di¤erential expression model

p (p + 1)

pc = p +

2 (n p + 1)

n = sample size

number of measured expression ratios

p = model dimension

number of unknown parameters in the model

p = 1 for the equivalent expression model

p = 2 for the di¤erential expression model

p (p + 1)

pc = p +

2 (n p + 1)

e¤ective number of parameters in the model (corrected for small n)

n = sample size

number of measured expression ratios

number of unknown parameters in the model

p = 1 for the equivalent expression model

p = 2 for the di¤erential expression model

p (p + 1)

pc = p +

2 (n p + 1)

e¤ective number of parameters in the model (corrected for small n)

Complexity = 2.718pc

n = sample size

number of measured expression ratios

number of unknown parameters in the model

p = 1 for the equivalent expression model

p = 2 for the di¤erential expression model

p (p + 1)

pc = p +

2 (n p + 1)

e¤ective number of parameters in the model (corrected for small n)

Complexity = 2.718pc

Fit

= Evidence

Complexity

Answers

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says a di¤erentially expressed gene is

equivalently expressed, is the method useless?

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says a di¤erentially expressed gene is

equivalently expressed, is the method useless?

The advantage of obtaining more data

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says a di¤erentially expressed gene is

equivalently expressed, is the method useless?

The advantage of obtaining more data

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says a di¤erentially expressed gene is

equivalently expressed, is the method useless?

The advantage of obtaining more data

How con…dent should you be in your assessments?

If a statistical method says an equivalently expressed gene is

di¤erentially expressed, is the method useless?

If a statistical method says a di¤erentially expressed gene is

equivalently expressed, is the method useless?

The advantage of obtaining more data

How con…dent should you be in your assessments?

Should you obtain more data before making an assessment?

The expression data sets

data set #1 data set #2 data set #4 data set #6

ratio 1 2 1 1.4

expression equivalent di¤erential equivalent di¤erential

n = 10 0.44/1.38 0.14/0.09 0.14/0.17 0.19/0.37

n = 25 0.29/0.71 0.03/0.002 4.77/1.00 0.05/0.04

1 10 4

n = 100 36/69 16/32 0.03/0.01

2 10 7

Key

n is the number of observed expression ratios.

Evidence di¤erentially expressed

Each ratio is , the weight of evidence

Evidence equivalently expressed

favoring di¤erential expression over equivalent expression.

* misleading evidence for di¤erential expression

** misleading evidence for equivalent expression

David Bickel (uOttawa) Information theory 17 November 2008 10 / 11

(AIC) after correcting it for small numbers of measurements

(AIC) after correcting it for small numbers of measurements

AICc = 2 ln (Evidence)

(AIC) after correcting it for small numbers of measurements

AICc = 2 ln (Evidence)

Software packages with the AIC but without the correction may be

unreliable for small numbers of observations (n < 40)

(AIC) after correcting it for small numbers of measurements

AICc = 2 ln (Evidence)

Software packages with the AIC but without the correction may be

unreliable for small numbers of observations (n < 40)

Kenneth Burnham and David Anderson, Model Selection and

Multi-Model Inference

(AIC) after correcting it for small numbers of measurements

AICc = 2 ln (Evidence)

Software packages with the AIC but without the correction may be

unreliable for small numbers of observations (n < 40)

Kenneth Burnham and David Anderson, Model Selection and

Multi-Model Inference

(AIC) after correcting it for small numbers of measurements

AICc = 2 ln (Evidence)

Software packages with the AIC but without the correction may be

unreliable for small numbers of observations (n < 40)

Kenneth Burnham and David Anderson, Model Selection and

Multi-Model Inference

www.statomics.com

