20
M. Dondrup et al. / Journal of Biotechnology 140 (2009) 18–26
ground intensity of both channels as delivered by ImaGene. The
M
-valueswerethennormalizedusinggloballowessnormalization(Yang et al., 2002).We used the Shapiro–Wilk-test(Shapiro and Wilk, 1965)to test
for gene-wise normality of the data. The test was performed sepa-ratelyforeachstepofdilution,yielding752reporterswithupto50replicates. The statistical ranking methods evaluated in this studyaredescribedinthefollowingsections.Weusedthestatisticalenvi-ronmentR(RDevelopmentCoreTeam,2008)f ortheevaluationand
statisticaltestsandtheEMMA2(Dondrupetal.,2009)softwarefor
storage and pre-processing.
2.3. Student’s t-test
Student’s
t
-test is a commonly used method for testing signif-icant changes in small samples. For a direct comparison as in ourcase, the null hypothesis of the
t
-test is
H
0
:¯
x
=
(¯
x
denotes thearithmeticmeanofall
M
-valuesforagene).Theteststatisticisgivenas
t
=
¯
x
−
s
2
/n,
(1)where
s
2
denotes the empirical variance estimate given as
s
2
=
1
n
−
1
n
i
=
1
(
x
i
−
¯
x
)
2
,
and
n
denotes the sample size. A two-sided alternative hypothesisis used, because we want to detect significant up- and down-regulationofgenes.Someofthemethodsdescribedinthefollowingalso rely on modified versions of this statistic.
2.4. Wilcoxon’s rank-sum test
Wilcoxon’srank-sumtestisanon-parametrictest(Siegel,1956),and as such it does not rely on any assumption about the dis-tribution of the data, which is a useful property for possiblynon-normally distributed microarray data. To compute the teststatistic, the absolute values of the observations are ranked. Thesum of the ranks of the positive observations is computed:
W
+
=
n
i
=
1
rank(
x
i
:
x
i
>
0)
,
(2)where
n
is the size of the sample. A
p
-value for the probability of observing a given rank-sum, or a more extreme value, can be cal-culated by counting all permutations of 1
, ..., n
which result in arank-sum equal or greater than
W
+
.
2.5. Cyber-T
BaldiandLong(2001)addresstheproblemofsmallsamplesizesandtherebythepoorestimateofsamplevariances.Theyintroducea Bayesian framework for estimating the variance, which is thenintroduced into the standard
t
-test, turning it into a regularized
t
-test.The authors derive a modified variance estimate:
2
=
0
20
+
(
n
−
1)
s
2
0
+
n
−
2
,
(3)with
20
definedasabackgroundvarianceand
0
asaweightparam-eter for
20
and
s
2
as the empirical sample variance. The weightparameter can be interpreted as a measure of confidence in theBayesianestimateofthevarianceincomparisontothesamplevari-ance.
2
can then be used in the standard
t
-test formula as in Eq.(1),resulting in a regularized
t
-test. The method requires settingtwo additional parameters
20
and
0
. The background variance iscomputedfromafixednumberofotherfeaturemeasurementsfromall microarrays. The default implementation uses a window of size
w
around the measured values with default
w
=
101.
2.6. LIMMA
Smyth (2004)published an approach that combines an empiri-calBayesianmethodwithamoderated
t
-statisticandgenerallinearmodels.Generallinearmodelshavetheadvantageofallowingmorecomplex experimental designs including dye-swaps. The model isnot restricted to a simple replicate design or two sample compar-isons. A general linear model that has the form
y
g
=
b
g
+
X
a
g
isfittedtoexpressiondataforeachgene.
y
g
=
(
y
g
1
, ..., y
gn
)
T
denotesan
n
-dimensional response vector of log-ratios or intensity mea-surements from single channel microarrays.
X
is the experimentdesign matrix,
a
g
=
(
a
g
1
, ..., a
gn
) denotes the vector of regressioncoefficients, and
b
g
the intercept vector. The fitted model parame-ters are used for the subsequent analysis steps.
2.7. SAM
SignificanceAnalysisforMicroarrays(SAM)hasbeendevelopedtotackletheproblemoftheunknowndistributionoftheteststatis-tic and also the problem of small sample sizes(Tusher et al., 2001).Resampling is a technique to estimate an empirical distributionfrom the data. SAM draws random samples from each group foreach gene and re-assigns replicates randomly to groups 1 and 2(for one-sample experiments, a random sample from the group ismultiplied by
−
1). Under the assumption, that only few genes aredifferentially expressed, a data set resembling an artificial back-ground distribution of the test statistic can be achieved.SAM uses a modified
t
-statistic of the form:
b
=
¯
x
1
−
¯
x
2
s
+
s
0
,
(4)where¯
x
1
,
¯
x
2
denote the group means,
s
denotes the joint samplestandard deviation of both samples, and
s
0
is a small constant forstabilizing the standard deviation.
2.8. VarMixt
Delmar et al. (2005)have developed a novel approach for get-tinganimprovedestimateofthevariancebytheuseofmixturesof distributions.Thegene-wisedifferencestatistic
g
isderivedfroma mixture of normal distributions. The variance
is further mod-eledasaweightedmixtureofGammadistributions;theparametersof the mixture model are estimated from the observed data usingan expectation maximization (EM) approach. VarMixt requires todefine the number of variance classes for the given experiment
a priori
.Incontrasttoallothermethodsthevarianceestimatesarenotdeterministic, because they are derived from a mixture model fit-tedbyanEM-algorithm.Thismeansmultipletestrunsonthesamedata could yield different results of the test statistic and the rankorder of genes unlike any other test. Therefore, we have repeatedthe evaluation for this method and also checked it with differentnumbers of variance classes.
2.9. Rank products
Breitling et al. (2004)have proposed rank products as a non-parametric statistic to assess differential expression. The approachis based on the assumption that under the null hypothesis of nodifferential expression, the probability of observing a gene
g
at thetop ranking position just by chance in an ordered list of
n
genes, is
p
up
1
,g
=
1
/n
for replicate experiment
i
. Given
k
such replicates and
Add a Comment