You are on page 1of 34

:: Microarray analysis ::

•Data pre-processing
•Normalization
•Molecular diagnosis
•Statistical classification
Florian Markowetz
florian@genomics.princeton.edu
From experiment to data
Raw data are not mRNA
concentrations
• tissue contamination • spotting efficiency
• RNA degradation • DNA support binding
• amplification efficiency
• other array manufacturing
• reverse transcription efficiency
related issues
• Hybridization efficiency and
specificity • image segmentation
• clone identification and • signal quantification
mapping • “background” correction
• PCR yield, contamination
Quality control:
Noise and reliable signal
Probe level Array level Gene level

Arrays 1 ... n

Probe level: quality of the expression measurement of one spot


on one particular array
Array level: quality of the expression measurement on one
particular glass slide
Gene level: quality of the expression measurement of one probe
across all arrays
Probe-level quality control
• Individual spots printed on the slide
• Sources:
– faulty printing, uneven distribution, contamination with debris, magnitude
of signal relative to noise, poorly measured spots;
• Visual inspection:
– hairs, dust, scratches, air bubbles, dark regions, regions with haze
• Spot quality:
– Brightness: foreground/background ratio
– Uniformity: variation in pixel intensities and ratios of intensities within a
spot
– Morphology: area, perimeter, circularity.
– Spot Size: number of foreground pixels
• Action:
– set measurements to NA (missing values)
– local normalization procedures which account for regional idiosyncrasies.
– use weights for measurements to indicate reliability in later analysis.
Spot identification
Individual spots are recognized, size and shape might be adjusted
per spot (automatically fine adjustments by hand).
Additional manual flagging of bad (X) or non-present (NA) spots

NA
X

poor spot quality good spot quality

Different Spot identification methods: Fixed circles, circles with


variable size, arbitrary spot shape (morphological opening)
Spot identification
• The signal of the spots is quantified.
Histogram of pixel
intensities of a single spot

„Donuts“

Mean / Median / Mode / 75% quantile


Local background

GenePix
QuantArray
ScanAlyse
Array level quality control
• Problems:
– array fabrication defect
– problem with RNA extraction
– failed labeling reaction
– poor hybridization conditions
– faulty scanner

• Quality measures:
– Percentage of spots with no signal (~30% excluded spots)
– Range of intensities
– (Av. Foreground)/(Av. Background) > 3 in both channels
– Distribution of spot signal area
– Amount of adjustment needed: signals have to substantially changed to
make slides comparable.
Gene-level quality control
• Poor hybridization in the
reference channel may introduce
Gene g bias on the fold-change
• Some probes will not hybridize well
to the target RNA
• Printing problems: such that all
spots of a given inventory well have
poor quality.

•A well may be of bad quality – contamination


•Genes with a consistently low signal in the reference channel
are suspicious
Gene expression data
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Gene 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...

gene-expression level or ratio for gene i in mRNA sample j


Log2(red intensity / green intensity)
M=
Function (PM, MM) of MAS, dchip or RMA
average: log2(red intensity), log2(green intensity)
A=
Function (PM, MM) of MAS, dchip or RMA
Scatterplot
Data Data (log scale)

Message: look at your data on log-scale!


MA Plot
M = log2(R/G)

A = 1/2 log2(RG)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to the
same level.
Assumption: the majority of genes are un-changed between conditions.
Median is more robust to outliers than the mean.

Log Signal, centered at 0

Divide all
expression
measurements of
each array by the
Median.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
M-A Plot of the same data
after Median-centering

M = Log Red - Log Green


Log Red

Log Green A = (Log Green + Log Red) / 2


Lowess normalization

Local
Use the estimate to bend
M = Log Red - Log Green

estimate
the banana straight

A = (Log Green + Log Red) / 2


Summary I
• Raw data are not mRNA concentrations
• We need to check data quality on different
levels
– Probe level
– Array level (all probes on one array)
– Gene level (one gene on many arrays)
• Always log your data
• Normalize your data to avoid systematic (non-
biological) effects
• Lowess normalization straightens banana
From data to knowledge
Ok, now we made sure that our data is of high quality
and systematic, non-biological effects are removed.

The result is a gene expression matrix


mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Gene 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...

Is that already a result? No! It’s just data, not knowledge.


We need to use this data to answer a scientific question.
Supervised analysis
= learning from examples, classification
– We have already seen groups of healthy and sick
people. Now let’s diagnose the next person walking
into the hospital.
– We know that these genes have function X (and these
others don’t). Let’s find more genes with function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s extend
the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Un-supervised analysis
= clustering
– Are there groups of genes that behave similarly in all
conditions?
– Disease X is very heterogeneous. Can we identify
more specific sub-classes for more targeted
treatment?
No structure is known. We first need to find it.
Exploratory analysis.
Supervised analysis
Calvin, I still don’t
know the difference Don’t worry!
between cats and I’ll show you
dogs … once more:
Oh, now I get it!!

Class 1: cats Class 2: dogs


Un-supervised analysis
Calvin, I still don’t I don’t know it
know the difference either.
between cats and
dogs … Let’s try to figure
it out together …
Supervised analysis: setup
• Training set
– Data: microarrays
– Labels: for each one we know if it falls into our class
of interest or not (binary classification)
• New data (test data)
– Data for which we don’t have labels.
– Eg. Genes without known function
• Goal: Generalization ability
– Build a classifier from the training data that is good at
predicting the right class for the new data.
One microarray, one dot
Think of a space with #genes
dimensions (yes, it’s hard for
more than 3).
Expression of gene 2

Each microarray corresponds to


a point in this space.
If gene expression is similar
under some conditions, the
points will be close to each
other.
If gene expression overall is
Expression of gene 1 very different, the points will be
far away.
Which line separates best?
A B

C D
No sharp knive, but a …

E
AN
PL
T
FA
Support Vector Machines

Maximal margin
separating hyperplane

Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well do
we do on the data we trained
the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?

Same classifier (= line)


New data from same classes
The classifier will usually
perform worse than before:
Test error > training error
Cross-validation

Training error Train classifier and test it

Test error Train Test

K-fold Cross-validation
Step 1. Train Train Test

Here for
Step 2. Train Test Train
K=3

Step 3. Test Train Train


Summary II
• Supervised and un-supervised learning
… are needed everywhere in biology and medicine
• Microarrays = points in high-dimensional spaces
• Classifiers = lines (hyperplanes) in these spaces
• Support Vector Machines use maximal margin
hyperplanes as classifiers
• Classifier performance: Test error > training error
• Cross-validation is the right way to evaluate
classifier performance
Experimental Biological question
Cycle (hypothesis-driven or explorative)

To call in the statistician after the


Experimental design
experiment is done may be no more than
Failed
asking him to perform a post-mortem
Microarray experiment

examination:
Quality
Image analysis
Measurement Pre-processing

He may be able to say what the experiment


Normalization
Pass
died of.
Analysis
Estimation Testing Clustering Discrimination
Ronald Fisher

Biological verification
and interpretation
Terry Speed,
Books
„Statistical Analysis of
Gene Expression
Microarray Data”. David W. Mount,
Chapman & Hall/CRC „Bioinformatics“, Cold
Spring Harbor

Giovanni Parmigani
et al, „The Analysis
of Gene Expression
Data“, Springer

Gentleman, Carey,
Huber, “Bioinformatics
Pierre Baldi & G. and Computational
Wesley Hatfield, Biology Solutions Using
„DNA Microarrays R and Bioconductor”,
and Gene Springer
Expression”,
Cambridge
And how do I analyze my own data?

www.r-project.org
www.bioconductor.org
•Open source
•Free
•Easy installation
•Helpful community
•High quality standards
•Regularly maintained and updated
•Tons of documentation
•Every package comes with example
vignettes to walk you through standard
tasks.
Acknowlegdements
• I ‘borrowed’ slides from:
Tim Beissbarth, Achim Tresch, Wolfgang Huber,
Ulrich Mansmann, Terry Speed, Jean Yang, Benedikt
Brors, Anja von Heydebreck, Rainer König

• More info on microarray analysis, lectures,


tutorials:
http://compdiag.molgen.mpg.de/ngfn/

You might also like