Professional Documents
Culture Documents
Microarray Analysis::: - Data Pre-Processing - Normalization - Molecular Diagnosis - Statistical Classification
Microarray Analysis::: - Data Pre-Processing - Normalization - Molecular Diagnosis - Statistical Classification
•Data pre-processing
•Normalization
•Molecular diagnosis
•Statistical classification
Florian Markowetz
florian@genomics.princeton.edu
From experiment to data
Raw data are not mRNA
concentrations
• tissue contamination • spotting efficiency
• RNA degradation • DNA support binding
• amplification efficiency
• other array manufacturing
• reverse transcription efficiency
related issues
• Hybridization efficiency and
specificity • image segmentation
• clone identification and • signal quantification
mapping • “background” correction
• PCR yield, contamination
Quality control:
Noise and reliable signal
Probe level Array level Gene level
Arrays 1 ... n
NA
X
„Donuts“
GenePix
QuantArray
ScanAlyse
Array level quality control
• Problems:
– array fabrication defect
– problem with RNA extraction
– failed labeling reaction
– poor hybridization conditions
– faulty scanner
• Quality measures:
– Percentage of spots with no signal (~30% excluded spots)
– Range of intensities
– (Av. Foreground)/(Av. Background) > 3 in both channels
– Distribution of spot signal area
– Amount of adjustment needed: signals have to substantially changed to
make slides comparable.
Gene-level quality control
• Poor hybridization in the
reference channel may introduce
Gene g bias on the fold-change
• Some probes will not hybridize well
to the target RNA
• Printing problems: such that all
spots of a given inventory well have
poor quality.
A = 1/2 log2(RG)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to the
same level.
Assumption: the majority of genes are un-changed between conditions.
Median is more robust to outliers than the mean.
Divide all
expression
measurements of
each array by the
Median.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
M-A Plot of the same data
after Median-centering
Local
Use the estimate to bend
M = Log Red - Log Green
estimate
the banana straight
C D
No sharp knive, but a …
E
AN
PL
T
FA
Support Vector Machines
Maximal margin
separating hyperplane
Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well do
we do on the data we trained
the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?
K-fold Cross-validation
Step 1. Train Train Test
Here for
Step 2. Train Test Train
K=3
examination:
Quality
Image analysis
Measurement Pre-processing
Biological verification
and interpretation
Terry Speed,
Books
„Statistical Analysis of
Gene Expression
Microarray Data”. David W. Mount,
Chapman & Hall/CRC „Bioinformatics“, Cold
Spring Harbor
Giovanni Parmigani
et al, „The Analysis
of Gene Expression
Data“, Springer
Gentleman, Carey,
Huber, “Bioinformatics
Pierre Baldi & G. and Computational
Wesley Hatfield, Biology Solutions Using
„DNA Microarrays R and Bioconductor”,
and Gene Springer
Expression”,
Cambridge
And how do I analyze my own data?
www.r-project.org
www.bioconductor.org
•Open source
•Free
•Easy installation
•Helpful community
•High quality standards
•Regularly maintained and updated
•Tons of documentation
•Every package comes with example
vignettes to walk you through standard
tasks.
Acknowlegdements
• I ‘borrowed’ slides from:
Tim Beissbarth, Achim Tresch, Wolfgang Huber,
Ulrich Mansmann, Terry Speed, Jean Yang, Benedikt
Brors, Anja von Heydebreck, Rainer König