/  9
 
 Journal of Biotechnology 140 (2009) 18–26
Contents lists available atScienceDirect
 Journal of Biotechnology
An evaluation framework for statistical tests on microarray data
Michael Dondrup
, Andrea T. Hüser, Dominik Mertens, Alexander Goesmann
Center for Biotechnology, Bielefeld University, Universitätsstrasse 15, D-33594 Bielefeld, Germany
a r t i c l e i n f o
 Article history:
Received 21 August 2008Received in revised form22 December 2008Accepted 15 January 2009
Keywords:
MicroarrayStatistical testInferenceEvaluationComparative study
a b s t r a c t
Microarray analysis has become a popular and routine method in functional genomics. It is typical forsuchexperimentstoinvolveasmallnumberofreplicates,whichcausesunreliableestimatesofthesamplevariance.Microarrayshavefosteredthedevelopmentofnewstatisticalmethodstoanalyzedataresultingfrom experiments with small sample sizes. In this study, we tackle the problem of evaluating the per-formance of statistical tests for generating ranked gene lists from two-channel direct comparisons. Wepropose an evaluation method based on a oligonucleotide microarray with a large number of replicatespots yielding a maximum of 400 replicates per gene. We apply Spearman’s rank correlation coefficientto ranked gene-lists generated by eight widely used microarray specific test statistics, which are appliedto small random samples. We could show that variance stabilizing methods such as Cyber-T, SAM, andLIMMAcanbebeneficialforverysmallsamplesizesandthatSAMandthe
-testprovidestrongercontrolofthetypeIerrorratethantheothermethods.Specifically,wereportthatforfourreplicatesallmethodsreach a high to very high correlation with our reference standard.© 2009 Elsevier B.V. All rights reserved.
1. Introduction
Analysis of the whole transcriptome of an organism usinghigh-density microarrays has become an established method infunctionalgenomics.Routinepipelinestoconductlargescaleanal-yseshavebeensetup,whichaimtoidentifydifferentialexpressionof thousands of genes in parallel for a multitude of experimentalinfluences(Schena et al., 1995; Fodor et al., 1993).Themostbasicquestiontoposewithinamicroarrayexperimentis to identify a set of reliably affected genes under the influenceof known experimental conditions. This question is also relevantforfurtherdata-miningstepssuchasunsupervisedandsupervisedclassificationtorestrictthenumberofgenestoamanageablequan-tity of interesting candidates or as a feature selector.Replication is an undoubtedly important aspect of microarrayexperiments, because it allows to assess biological and technicalvariation(Lee et al., 2000; Yang and Speed, 2002; Allison et al.,2006).A major hurdle for a reliable generation of ranked gene listsis the small number of experimental replications, common in geneexpression studies, which are mainly due to the costs of arrays andreagents, and in particular the restricted availability of biologicalmaterial. Small sample sizes can result in unstable estimates of the sample variance. The classical approach to this problem is Stu-dent’s one-sample
-test of whether the data can be assumed to benormallydistributedunderthenull-hypothesis.Thisassumptionis
Corresponding author. Tel.: +49 521 106 4827; fax: +49 521 106 6419.
E-mail address:
howeveroftentakenintodoubtformicroarrays(Hunteretal.,2001;ZhaoandPan,2003).Wilcoxon’srank-sumtestisanon-parametricalternativewhichmakesnoassumptionsaboutthedistribution,butas a tradeoff has less power to detect real differences(Thomas etal., 2001; Pan, 2002).New methods such as Cyber-T, LIMMA, SAM, VarMixt, and rankproducts have been published recently. These methods have beendeveloped with a focus on microarray-data and some in particulartry to address the problem of small sample sizes due to limitationsin the number of replicates.Testing for significant expression also has a large impact ondownstream analyses, for example when used as a filtering stepprior to a cluster analysis or classification. Therefore, it is of pri-maryimportancetoevaluatestatisticalinferencemethods.Severalcomparativestudiesexist,whichmakeuseofrankedgenelists.Pan(2002)made a comparison between three two-sample statisticsusing the leukemia data of Golub et al. (1999).Troyanskaya et al. (2002)comparedanon-parametricversionofthe
-test,Wicoxon’srank-sumtest,andanidealdiscriminatormethodwhichinvolvesan
ad hoc
definition of a theoretically ideal discriminator gene and aPearson correlation score. In their study, synthetic data and sev-eral biological experiments were analyzed using a fixed
p
-valuecutoff instead of ranking genes.Qin et al. (2004)investigated the effect of data transformation and ranking procedures using cDNAmicroarrays with 10 spike-in genes added to the sample mate-rial at defined concentrations.Delmar et al. (2005)introduced the VarMixt method and compared it with the
-test, SAM, and Cyber-T using several synthetic data sets and 63 control data sets froma yeast compendium experiment using two-channel cDNA arrays
0168-1656/$ – see front matter © 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.jbiotec.2009.01.009
 
M. Dondrup et al. / Journal of Biotechnology 140 (2009) 18–26
19
(Hughes et al., 2000).A spike-in data set with Affymetrix microar-rayswasalsousedinthesamestudytoassesfalsepositiveandfalsenegativerates. Jefferyetal.(2006)evaluatedstatisticaltestsonnine experimental data sets by comparing the content of gene lists of top-scoring candidates. They also applied the methods as a featureselector for classification methods. Their approach was based onthe assumption that better feature selection methods yield betterresults in the consecutive binary classification process.Themajorproblemforconstructingavalidationexperimentfortest-proceduresthatrankgenesaccordingtodifferentialexpressionis the lack of a gold-standard with which to compare the results.There is no method to acquire a ‘true’ ranking of genes in biolog-ical experiments, although biological knowledge was applied inthe previous studies to identify potential true candidate genes
a priori
.Thereisalsonobenchmarkdatasetagainstwhichtheperfor-manceoftestprocedurescanbeevaluatedforspottedtwo-channelmicroarrays using a direct comparison design. The comparativestudiesmentionedaboveareallfocussedonsingle-channelordual-channel two-sample comparisons.To address this problem, a new approach to compare the rela-tivemeritsofrankingmethodswillbepresented,whichisbasedonthe application of a novel multi-replicate microarray. Three mainaspects will be tested within this study: Firstly, all methods arecross-compared on a for rank-concordance between the gener-ated gene lists on the full data set. Secondly, all methods will betested for rank-concordance with a reference standard (RS) withrandomly drawn small samples. The central idea behind this is tocompare ranked gene lists computed from randomly drawn froma very large data set with the ranks on the large sample. An effi-cient method should resemble the ranking of the large sample alsowhen presented with a small random sample. Our approach fol-lowstheassumptionthatthecommonfeatureofallmethodsisthegeneration of 
p
-values and that these
p
-values can be used to rankgenes with respect to their probability of being truly differentiallyexpressed. To compare rank stability, random samples of differentsizesrangingfrom2to20aredrawnandcomparedwithareferencelist by Spearman’s rank correlation coefficient. Finally, we providean estimation of the type one error rate for the different methods.
2. Materials and methods
 2.1. Experiment design
Weusedamicroarraydesignspecificallydevelopedforthepur-pose of this evaluation study. Our multi-replicate microarray hasmany more replicate spots per gene and therefore carries fewerdifferent genes than a normal microarray. This type of array wasspotted using 70-mer synthetic oligonucleotides (Operon Biotech-nologiesGmbH,Cologne,Germany)representing92uniquecodingsequences of 
Corynebacterium glutamicum
and four external con-trols. The coding sequences mainly represent genes presumablyinvolved in the sulfur metabolism of 
C. glutamicum
(Rückert et al.,2005).The spotting process was managed by the MicroGrid II 600spotter(BioRobotics,Cambridge,UK)equippedwith48SMP3MicroSpottingPins(TeleChemInternational,Sunnyvale,CA)asdescribedpreviously(Hüser et al., 2003).Each oligonucleotide sequence is represented in 80 replicatespots on each array, comprising eight dilution steps, resulting in atotal of eight identical groups with respect to sequence and DNAconcentration of 10 replicate spots per gene. Five arrays of thisdesign were available for this project yielding a total of 400 repli-cates per gene.The sample material generated for this experiment was takenfrom the
C. glutamicum
ATCC 13032 strains G306 and G304.Both carry triple mutations in the same regulatory genes, genet-ically engineered by promoter constructs and gene deletion. RNA
Fig. 1.
Density scatterplot of the background corrected raw intensities of one of the multireplicate microarrays as delivered by the ImaGene software (Ch1 denotesCy3 intensity and Ch2 denotes Cy5 intensities). The plot exposes the tri-partitenature of the data distribution, which is due to the fact that the array representsover-expressed(upperpartition),repressed(lowerpartition),andunchangedgenes(smaller central partition). The yellow line represents the main diagonal. (For inter-pretationofthereferencestocolorinthisfigurelegend,thereaderisreferredtotheweb version of the article.)
from the two strains was directly compared in six hybridizations.Both strains were grown under identical conditions on minimalsulfurless medium (MMS) with addition of 1mmol cystein and1mmol sulfate. Cells were harvested and RNA extracted followingthe protocols described in(Hüser et al., 2003).Signal acquisi- tion was performed with the ScanArray 4000 microarray scanner(PerkinElmer, Boston, MA).To compare the approximate false positive rate of the methods,a whole-genome microarray design(Brune et al., 2006)was usedandhybridizedwithinaself–self-comparisonexperiment.RNAwasextractedfrom
C.glutamicum
cellsinthelogarithmicgrowthphaseand the RNA was split and labelled with Cy3 and Cy5 dyes. Thecorrelationbetweenbackgroundcorrectedchannelintensitieswasgenerally better than 0.95 for the self–self-experiments.
 2.2. Pre-processing and normalization
All data were processed and normalized using the same meth-ods: images were scanned per channel with equal settings perimage; segmentation and quantification of spot intensities wasperformed with the ImaGene 5.0 software (BioDiscovery Inc., ElSegundo, CA); automated flagging of low-quality or low intensityspots was applied and flagged spots were excluded from the anal-ysis; no manual flagging was applied. Density scatterplots of theraw data (seeFig. 1)exhibit three distinct areas of up-regulated spots, down-regulated spots, and an area of no differential reg-ulation. Statistics such as mean and variance computed for eachchannelindicatethattherewereonlyslightdifferencesintheglobalfeaturesoftheself–self-experimentandthemulti-replicatearrays.Intensity estimates for both channels were background cor-rected by subtracting the background intensities estimated byImaGene.
M
-values as logarithmic transformed ratios were com-puted from the background corrected channel intensities for eachspot:
M
=
log
2
(
1
B
1
)
log
2
(
2
B
2
) and
A
-values as total inten-sity measurements:
A
=
(1
/
2)(log
2
(
1
B
1
)
+
log
2
(
2
B
2
)), with
1
,
2
and
B
1
,
2
denoting the mean signal intensity and mean back-
 
20
M. Dondrup et al. / Journal of Biotechnology 140 (2009) 18–26
ground intensity of both channels as delivered by ImaGene. The
M
-valueswerethennormalizedusinggloballowessnormalization(Yang et al., 2002).We used the Shapiro–Wilk-test(Shapiro and Wilk, 1965)to test for gene-wise normality of the data. The test was performed sepa-ratelyforeachstepofdilution,yielding752reporterswithupto50replicates. The statistical ranking methods evaluated in this studyaredescribedinthefollowingsections.Weusedthestatisticalenvi-ronmentR(RDevelopmentCoreTeam,2008)ortheevaluationand statisticaltestsandtheEMMA2(Dondrupetal.,2009)softwarefor storage and pre-processing.
 2.3. Student’s t-test 
Student’s
-test is a commonly used method for testing signif-icant changes in small samples. For a direct comparison as in ourcase, the null hypothesis of the
-test is
H
0
:¯
 x
=
(¯
 x
denotes thearithmeticmeanofall
M
-valuesforagene).Theteststatisticisgivenas
=
¯
 x
 
s
2
/n,
(1)where
s
2
denotes the empirical variance estimate given as
s
2
=
1
n
1
n
i
=
1
(
 x
i
¯
 x
)
2
,
and
n
denotes the sample size. A two-sided alternative hypothesisis used, because we want to detect significant up- and down-regulationofgenes.Someofthemethodsdescribedinthefollowingalso rely on modified versions of this statistic.
 2.4. Wilcoxon’s rank-sum test 
Wilcoxon’srank-sumtestisanon-parametrictest(Siegel,1956),and as such it does not rely on any assumption about the dis-tribution of the data, which is a useful property for possiblynon-normally distributed microarray data. To compute the teststatistic, the absolute values of the observations are ranked. Thesum of the ranks of the positive observations is computed:
+
=
n
i
=
1
rank(
 x
i
:
x
i
>
0)
,
(2)where
n
is the size of the sample. A
p
-value for the probability of observing a given rank-sum, or a more extreme value, can be cal-culated by counting all permutations of 1
, ..., n
which result in arank-sum equal or greater than
+
.
 2.5. Cyber-
BaldiandLong(2001)addresstheproblemofsmallsamplesizesandtherebythepoorestimateofsamplevariances.Theyintroducea Bayesian framework for estimating the variance, which is thenintroduced into the standard
-test, turning it into a regularized
-test.The authors derive a modified variance estimate:
 
2
=
0
 
20
+
(
n
1)
s
2
0
+
n
2
,
(3)with
 
20
definedasabackgroundvarianceand
0
asaweightparam-eter for
 
20
and
s
2
as the empirical sample variance. The weightparameter can be interpreted as a measure of confidence in theBayesianestimateofthevarianceincomparisontothesamplevari-ance.
 
2
can then be used in the standard
-test formula as in Eq.(1),resulting in a regularized
-test. The method requires settingtwo additional parameters
 
20
and
0
. The background variance iscomputedfromafixednumberofotherfeaturemeasurementsfromall microarrays. The default implementation uses a window of size
w
around the measured values with default
w
=
101.
 2.6. LIMMA
Smyth (2004)published an approach that combines an empiri-calBayesianmethodwithamoderated
-statisticandgenerallinearmodels.Generallinearmodelshavetheadvantageofallowingmorecomplex experimental designs including dye-swaps. The model isnot restricted to a simple replicate design or two sample compar-isons. A general linear model that has the form
y
g
=
b
g
+
 X 
a
g
isfittedtoexpressiondataforeachgene.
 y
g
=
(
 y
 g 
1
, ..., y
 gn
)
denotesan
n
-dimensional response vector of log-ratios or intensity mea-surements from single channel microarrays.
is the experimentdesign matrix,
a
g
=
(
a
 g 
1
, ..., a
 gn
) denotes the vector of regressioncoefficients, and
b
g
the intercept vector. The fitted model parame-ters are used for the subsequent analysis steps.
 2.7. SAM
SignificanceAnalysisforMicroarrays(SAM)hasbeendevelopedtotackletheproblemoftheunknowndistributionoftheteststatis-tic and also the problem of small sample sizes(Tusher et al., 2001).Resampling is a technique to estimate an empirical distributionfrom the data. SAM draws random samples from each group foreach gene and re-assigns replicates randomly to groups 1 and 2(for one-sample experiments, a random sample from the group ismultiplied by
1). Under the assumption, that only few genes aredifferentially expressed, a data set resembling an artificial back-ground distribution of the test statistic can be achieved.SAM uses a modified
-statistic of the form:
b
=
¯
 x
1
¯
 x
2
s
+
s
0
,
(4)where¯
 x
1
,
¯
 x
2
denote the group means,
s
denotes the joint samplestandard deviation of both samples, and
s
0
is a small constant forstabilizing the standard deviation.
 2.8. VarMixt 
Delmar et al. (2005)have developed a novel approach for get-tinganimprovedestimateofthevariancebytheuseofmixturesof distributions.Thegene-wisedifferencestatistic
 g 
isderivedfroma mixture of normal distributions. The variance
 
is further mod-eledasaweightedmixtureofGammadistributions;theparametersof the mixture model are estimated from the observed data usingan expectation maximization (EM) approach. VarMixt requires todefine the number of variance classes for the given experiment
a priori
.Incontrasttoallothermethodsthevarianceestimatesarenotdeterministic, because they are derived from a mixture model fit-tedbyanEM-algorithm.Thismeansmultipletestrunsonthesamedata could yield different results of the test statistic and the rankorder of genes unlike any other test. Therefore, we have repeatedthe evaluation for this method and also checked it with differentnumbers of variance classes.
 2.9. Rank products
Breitling et al. (2004)have proposed rank products as a non-parametric statistic to assess differential expression. The approachis based on the assumption that under the null hypothesis of nodifferential expression, the probability of observing a gene
at thetop ranking position just by chance in an ordered list of 
n
genes, is
 p
up
1
,g 
=
1
/n
for replicate experiment
i
. Given
k
such replicates and

Share & Embed

More from this user

Add a Comment

Characters: ...