This action might not be possible to undo. Are you sure you want to continue?

Vincent A. Voelz

E-mail: vvoelz@gmail.com

†

Math Bio Boot Camp 2006

University of California at San Francisco, San Francisco, CA 94143

August 30, 2006

Contents

1 Introduction to Statistical tests 2

1.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Parametric Tests 5

2.1 Parametric versus Non-parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The χ

2

-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 The ‘Student’ t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 The two-sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 The Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Non-parametric tests 9

3.1 Wilcoxon rank-sum test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1

1 INTRODUCTION TO STATISTICAL TESTS 2

3.3 Bootstrapping: resampling with replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Jackkniﬁng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1 Introduction to Statistical tests

The underlying concept in statistics is quite easy to understand. A statistic is simply a computed value,

θ({x

1

, x

2

, ...}), that characterizes a particular data sample {x

1

, x

2

, ...}. If we have some model of the theoret-

ical probability distribution of this statistic, P(θ), we can assess its signiﬁcance by considering the likelihood

of our statistic’s value.

This process of assessment is surprisingly arbitrary. We can set up any number of conditions to evaluate

whether our statistic is signiﬁcant, and by how much. For example, it is common to declare that if a

statistic’s value is ±2σ away from the hypothetical mean value expected by random chance, then it is

signiﬁcant. Perhaps ±3σ is a safer tolerance; it’s all in how we deﬁne “signiﬁcance”.

In classical statistics, the hypothesized distribution of a statistic has usually been derived using the de facto

assumption that underlying variations are normally distributed (a good assumption, it turns out, according

to the Central Limit Theorem). The assumption of normally distributed variables leads to the celebrated

χ

2

- and t-distributions. Other kinds of statistics have diﬀerent assumptions, which lead to other kinds of

distributions. We may also want to consider the case where we have no model of the underlying distribution of

our statistics, and instead wish to empirically construct a probability distribution directly from the sampled

data.

1.1 Hypothesis Testing

Hypothesis testing is the rational framework for applying statistical tests.

The main question we usually wish to extract from a statistic is whether the sample data is signiﬁcant or

not. For example, let’s say we have a hat with two kinds of numbers in it: some of the numbers are drawn

from a standard normal distribution (i.e. σ

2

= 1) with mean µ = 0, and some of the numbers are drawn

from a standard normal distribution with unknown mean.

Now let’s say we take a number out of the hat. There are two hypotheses that are possible:

• H

0

: the null hypothesis. The number is from a standard normal distribution with µ = 0.

• H

A

: the alternative hypothesis. The number is not from a standard normal distribution with

µ = 0.

The art of statistics is in ﬁnding good ways of formulating criteria, based on the value of one more statistics,

to either accept or reject the null hypothesis H

0

.

1 INTRODUCTION TO STATISTICAL TESTS 3

?

N(µ = 0, σ = 1) N(?, σ = 1)

Figure 1: Numbers drawn from two diﬀerent standard normal distributions are thrown into John Chodera’s

hat.

It should be noted that H

0

and H

A

can be almost anything, and as complicated or as simple as we wish. If a

hypothesis is stated such that it speciﬁes the entire distribution, we call it a simple hypothesis. Otherwise,

we call it a composite hypothesis. As you might imagine, more rigorous tests can be done with simple

hypotheses, because they specify the entire distribution, from which probability values can be computed.

Type I and Type II errors. In any testing situation, two kinds of error could occur:

• Type I (false positive). We reject the null hypothesis when it’s actually true.

• Type II (false negative). We accept the null hypothesis when it’s actually false.

There is no way to completely eliminate both kinds of errors. For instance, say we draw the number 4 out

of the hat, we may reject the null hypothesis when actually it was generated from the N(0, 1) distribution

hypothesized in H

0

, and just happens to be +4σ away from the mean. This would be a Type I error.

Similarly, we might draw the number 0 out of the hat, and incorrectly accept the null hypothesis when

actually it was generated from a N(µ = 3, 1) distribution, and just happens to be −3σ away from the mean.

This would be a Type II error.

The probability of committing a Type I error is typically denoted α, and the probability of a Type II error

is denoted β.

• α: the probability of making a Type I error (false positive).

• β: the probability of making a Type II error (false negative).

α is often called a signiﬁcance level or sensitivity. Typically, we try to ﬁx an accepted level, α of Type I

error, and go on to ﬁnd ways of minimizing the level of Type II error, β.

1 INTRODUCTION TO STATISTICAL TESTS 4

The statistical power of a test is deﬁned as (1 −β). We usually want to maximize the power of our test in

order to detect as many signiﬁcant signals from our data as we possibly can.

1.2 p-values

Let’s say we have a statistic x, and the null hypothesis is that x comes from a standard normal probability

distribution P(x) = N(µ = 0, σ

2

= 1). We want to choose a critical value of x, call it x

∗

, that we can use as

a criterion rejecting the null hypothesis:

• CRITERION: Reject the null hypothesis if x ≥ x∗.

To reach a target signiﬁcance level α that we have chosen beforehand, we need to choose x

∗

such that the

probability of observing our statistic’s value, or anything greater, is α:

∞

x

∗

P(x)dx = α

For a standard normal distribution, it turns out that if you choose a critical value of x

∗

≈ +2σ, a α = 0.05

will result.

95% of the

area is to the

left of the

threshold

x

∗

x

α = 0.05

Remaining 5%

of the area

≈ +2σ

P(x)

Figure 2: For a null hypothesis that assumes that our statistic x is drawn from a standard normal distribution,

choosing a threshold value of x

∗

= +2σ gives a signiﬁcance level of about α = 0.05.

The p-value is the way to test against this signiﬁcance level α. Suppose you measure your statistic to be

x

†

. Then the p-value for your statistic is:

p

x

† = P(x ≥ x

†

|H

0

) =

∞

x

†

P(x)dx

2 PARAMETRIC TESTS 5

If the p-value is less than α, then you might want to consider rejecting the null hypothesis. If the p-value is

considerably less than α, then the statistic is highly signiﬁcant (at least compared to the signiﬁcance of the

hypothesis test).

(Note: the p-value is not the probability of observing this data, P(x

†

), rather it is the probability of observing

a value of x that is at least x

†

or larger, P(x ≥ x

†

).)

This is the general way most statistical tests are applied. Each statistical test consists of a statistic that you

compute, and (usually) a known distribution of this statistic, which is the null hypothesis. (I say “usually”

because in some case you have to build this distribution empirically from the data). More about this in the

next section.). You ﬁrst:

1. pick an appropriate signiﬁcance level, α, and then

2. calculate (or look up from a table) the p-value of your observed statistic.

3. If the p-value is less than α, then reject the null hypothesis.

In the sections below we will describe oﬀ number of diﬀerent kinds of statistical tests. Keep in mind that

they are all used according to this same general recipe.

2 Parametric Tests

2.1 Parametric versus Non-parametric Tests

In general, there are two kinds of statistical tests. Classical statistics mostly deals with parametric tests.

These are tests which assume some sort of model for the underlying distribution that the sample data is

drawn from. Many of the statistical distributions used in these tests assume that the data is drawn from a

normal (Gaussian) distribution. Given this assumption, much can be derived about the distribution of the

observations themselves.

Non-parametric tests do not assume any kind of underlying probability distribution. This can be very

useful in cases where it would be very hard to justify that the data are normally-distributed (or if we know

it’s just plain not true). Many non-parametric tests can quite powerful simply by considering the rank order

of the observations.

2.2 The χ

2

-test

The χ

2

-test is a useful statistic for calculating “goodness of ﬁt”. Suppose we have N independent variables,

x

i

, each normally distributed with mean µ

i

and variance σ

i

. The random variable

Z =

N

i

(x

i

−µ

i

)

2

σ

2

i

2 PARAMETRIC TESTS 6

Parametric Tests Non-Parametric Tests

Assumes a model for the

underlying distribution of a

statistic (often Gaussian)

No assumptions about

underlying distribution

Typically deals with

distributions of actual

observations

Typically deals with the

rank ordering of

observations

Figure 3: Some diﬀerences between parametric and non-parametric tests.

is distributed according to a χ

2

-distribution with N degrees of freedom. (See the Statistics Primer notes for

a detailed description of this.) If you have a set of observations O

i

, and a set of expected values for each

of your observations E

i

, the following value also has a distribution that is a very good approximation to a

χ

2

-distribution:

Z =

N

i

(O

i

−E

i

)

2

E

i

You can estimate the statistical signiﬁcance of your results by looking up p-values for the χ

2

-distribution.

2.3 The ‘Student’ t-test

The t-distribution is especially useful for testing the statistical signiﬁcance of sample means and sample

variances. Suppose we have a set of N independent, normally-distributed variables x

i

with mean µ and

variance σ

2

.

Now, if we knew the true variance σ, then it is straightforward to show that the renormalized variable

Z =

¯ x −µ

σ/

√

N

is normally distributed with mean 0 and variance 1. But if all we have is a sample of data, the best guess

we have for the variance is the sample variance, s

2

:

s

2

=

1

N −1

N

i=1

(x

i

−µ)

2

(1)

In this case, the quantity

Z =

¯ x −µ

s/

√

N

is not quite distributed as a standard normal. Instead, it is distributed according to a t-distribution

with (N − 1) degrees of freedom (Again, see the Statistics Primer notes for the functional form of this

distribution.)

2 PARAMETRIC TESTS 7

The t-test is a statistical test that uses the sample mean and sample variance to determine whether or not

a given sample comes from a normal distribution. To use the test, you calculate the statistic Z and ﬁnd the

p-value for it according to the t-distribution.

2.4 The two-sample t-test

Perhaps the most widely-used application of the t-distribution is to test whether or not two samples come

from a distribution with the same mean. A similar reasoning as above is used here:

Consider two sets of sample data: a set of N independent, normally-distributed variables x

i

with mean µ

x

and variance σ

2

x

, and a set of M independent, normally-distributed variables y

i

with mean µ

y

and variance

σ

2

y

.

Again, if we knew the true variance σ, then it is straightforward to show that the renormalized variable

Z =

(¯ x −µ

x

) −(¯ y −µ

y

)

σ

1/N + 1/M

is distributed as a standard normal distribution. However, since we don’t know this, we must estimate it

using the pooled sample variance, S

p

2

:

S

p

2

=

(N −1)s

x

2

+ (M −1)s

y

2

n +m−2

where s

x

2

and s

y

2

are the sample variances for x

i

and y

i

, respectively.

It can then be shown that the quantity

Z =

(¯ x −µ

x

) −(¯ y −µ

y

)

S

p

1/N + 1/M

is distributed as a t-distribution with (n + m − 2) degrees of freedom. Like before, to use the test, you

calculate the statistic Z and ﬁnd the p-value for it according to the t-distribution.

Trivia. The ‘Student’ t-statistic is sometimes called the Guiness t-test, because its inventor, William Sealy

Gosset, worked for the Guiness brewery in Dublin, and formulated it to measure the statistical consistency

across batches of brewed beer. Because he was publishing trade secrets, he was forced to use a pen name,

‘Student’.

2.5 Likelihood Ratios

In the case that both the null hypothesis H

0

and the alternative hypothesis H

A

are simple (i.e. they specify

the entire probability distribution P(x)), then it can be shown that hypothesis tests using likelihood ratios

give the most amount of statistical power, (1 −β).

2 PARAMETRIC TESTS 8

The likelihood ratio (LR) for a statistic x is deﬁned as:

LR =

p(x|H

A

)

p(x|H

0

)

In the example shown in Figure 4, when this ratio surpasses LR > 1, we reject the null hypothesis. This can

be adjusted, depending on the desired level of signiﬁcance, α.

Since the logarithm is a monotonic function, often times it is more practical to consider the log-likelihood

ratio (LLR). When combined with Bayesian statistics, LR tests can be quite powerful.

!2 !1 0 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1.2

N(µ = 0, σ

2

= 1) N(µ = 2, σ

2

= 1)

x

LR =

p(x|H

A

)

p(x|H

0

)

Figure 4: Here, H

0

is that a number picked from our hat is a standard normal N(0, 1) (blue), and H

A

is

that the number is from N(2, 1) (red). We’ve chosen the point where the likelihood ratio (LR) ≥ 1 as the

best threshold for rejecting the null hypothesis.

2.6 The Kolmogorov-Smirnov test

The Kolmogorov-Smirnov (K-S) test is used to detect any diﬀerence between two distributions. It is some-

what limited, because it only has statistical power in the context of very small sample sizes.

In the K-S test, the null hypothesis H

0

is that a sample variable x comes from a parent distribution P(x),

and the alternative hypothesis H

A

is that it does not come from P(x).

The K-S statistic is deﬁned as:

D = max

−∞<x<∞

|S

N

(x) −P(x)|

where S

N1

(x) is the cumulative probability distribution (CDF) built up from the sampled data (see Figure

5), and P(x) is the cumulative distribution hypothesized for H

0

.

3 NON-PARAMETRIC TESTS 9

Alternatively, one may wish to compare two samples directly, by calculating the maximum diﬀerence between

the two diﬀerent cumulative distribution functions S

N1

(x) and S

N2

(x). The same statistic applies:

D = max

−∞<x<∞

|S

N1

(x) −S

N2

(x)|

To a useful approximation, the K-S-distribution of the D statistic can be calculated, and the p-value for a

measured D

†

is given by:

P(D ≥ D

†

) = Q

KS

N

e

+ 0.12 + 0.11/

N

e

D

where N

e

= N for the case of a single sample, and N

e

= N

1

N

2

/(N

1

+N

2

) for the two-sample case, and Q

KS

is deﬁned as:

Q

KS

(λ) = 2

∞

j=1

(−1)

j−1

e

−2j

2

λ

2

Figure 5: The K-S statistic D measures the maximum diﬀerence between the sample’s CDF, S

N

(x), and a

theoretical CDF, P(x).

3 Non-parametric tests

Suppose we have gene expression data for 1000 genes suspected to be involved in breast cancer from two

groups of patients. The ﬁrst group is 41 patients who metastasized, and the second group is 59 patients

who didn’t. It would be nice to apply a parametric test like the two-sample t-test for each gene, in order to

determine which gene are diﬀerentially expressed in the disease.

The problem is that gene expression data is notoriously non-normal and skewed such that we can’t trust

the t-test – the assumption of normally-distributed data is not valid. This is especially true if the data set

contains outliers (greatly aﬀecting the sample variance).

3 NON-PARAMETRIC TESTS 10

This is the kind of problem that may beneﬁt from non-parametric tests. We’ll use this example to illustrate

the following non-parametric tests.

3.1 Wilcoxon rank-sum test

The procedure for calculating the U-statistic for the Wilcoxon rank-sum test (also known at the Mann

Whitney U-test) is very conceptually simple. Consider two sets of data: N samples of x

i

and M samples of

y

i

.

1. First, rank all of the values in the combined set of all N +M samples, from 1 to (N +M).

2. Choose the sample for which the ranks seem to be smaller (the choice is relevant only to ease of

computation). Call this ”sample 1”, and call the other sample ”sample 2”.

3. Taking each observation in sample 1, count the number of observations in sample 2 that are smaller

than it.

4. The total of these counts is the U-statistic.

The U-test is included in most modern statistical packages. Just like the other tests, you must look up the

p-value of your measured U statistic from a table compiled from the U-distribution.

3.2 Permutation tests

Consider just one particular gene’s expression data for the two groups of patients, {x

i

}, 1 ≤ i ≤ N and

{y

j

}, 1 ≤ j ≤ M. Suppose we want to test the null hypothesis:

• H

0

: the sample means of the expression data for the two patient groups are drawn from the same

distribution.

One way to test this is to construct a probability distribution that corresponds to the null hypothesis, but

which is built empirically from the data. This can be done by repeatedly permuting the category labels and

compiling a histogram of the sample means ¯ x and ¯ y.

3.3 Bootstrapping: resampling with replacement

Sometimes we are very cautious about making any assumptions about the underlying distribution of sampled

data. However, one recourse is to make a very weak assumption and construct a hypothetical distribution

from the data itself.

Again, consider just one particular gene’s expression data for the two groups of patients, {x

i

}, 1 ≤ i ≤ N

and {y

j

}, 1 ≤ j ≤ M. Suppose we want to construct some statistical test based on the sample mean ¯ x

3 NON-PARAMETRIC TESTS 11

across the group of patients. A bootstrapped distribution P(¯ x) can be constructed by directly sampling

with replacement from the set of values {x

i

}; likewise P(¯ y) can be constructed directly be sampling with

replacement from {y

j

}. These can then be compared to null hypothesis distributions P

0

(¯ x) and P

0

(¯ y) which

have been constructed by sampling with replacement from {x

i

} ∪ {y

j

}.

(You can see why they call it “bootstrapping” – it’s almost like you’re getting something for nothing, like

you’re “picking yourself up by your bootstraps”.)

3.4 Jackkniﬁng

Jackknifed statistics are similar to bootstrapping, but instead of resampling, various subsets of the data are

systematically removed, and the distribution of the resulting variations are studied. This is especially useful

in characterizing how robust a given set of data is in supporting a conclusion. The robustness of calculated

expectation values over time series is a good example of this.

. θ({x1 . . . and some of the numbers are drawn from a standard normal distribution with unknown mean. . it turns out. let’s say we have a hat with two kinds of numbers in it: some of the numbers are drawn from a standard normal distribution (i. The assumption of normally distributed variables leads to the celebrated χ2 .. . . . . . . 1. . . A statistic is simply a computed value. Jackkniﬁng . The main question we usually wish to extract from a statistic is whether the sample data is signiﬁcant or not. . . that characterizes a particular data sample {x1 . . . ..e. according to the Central Limit Theorem). . . Perhaps ±3σ is a safer tolerance. and by how much. . and instead wish to empirically construct a probability distribution directly from the sampled data.4 Bootstrapping: resampling with replacement . . . .}). . This process of assessment is surprisingly arbitrary. . we can assess its signiﬁcance by considering the likelihood of our statistic’s value. . . . We may also want to consider the case where we have no model of the underlying distribution of our statistics. . . . based on the value of one more statistics. We can set up any number of conditions to evaluate whether our statistic is signiﬁcant. . . x2 . .1 INTRODUCTION TO STATISTICAL TESTS 2 3. . . . The number is from a standard normal distribution with µ = 0. . . .}. . which lead to other kinds of distributions. . . it is common to declare that if a statistic’s value is ±2σ away from the hypothetical mean value expected by random chance. • HA : the alternative hypothesis. . . . . . . . For example. . There are two hypotheses that are possible: • H0 : the null hypothesis. . .. . Now let’s say we take a number out of the hat. . . it’s all in how we deﬁne “signiﬁcance”. the hypothesized distribution of a statistic has usually been derived using the de facto assumption that underlying variations are normally distributed (a good assumption. The art of statistics is in ﬁnding good ways of formulating criteria. . to either accept or reject the null hypothesis H0 .1 Hypothesis Testing Hypothesis testing is the rational framework for applying statistical tests. x2 . P (θ). If we have some model of the theoretical probability distribution of this statistic. . . Other kinds of statistics have diﬀerent assumptions. . The number is not from a standard normal distribution with µ = 0. . then it is signiﬁcant. .and t-distributions. . In classical statistics. . . . . . . 10 11 1 Introduction to Statistical tests The underlying concept in statistics is quite easy to understand.. . σ 2 = 1) with mean µ = 0. For example. . . .3 3. .

from which probability values can be computed. As you might imagine. we call it a simple hypothesis. In any testing situation. say we draw the number 4 out of the hat. • β: the probability of making a Type II error (false negative). α of Type I error. The probability of committing a Type I error is typically denoted α. This would be a Type II error. Otherwise. . This would be a Type I error. 1) distribution hypothesized in H0 . we might draw the number 0 out of the hat. For instance. and the probability of a Type II error is denoted β. and incorrectly accept the null hypothesis when actually it was generated from a N (µ = 3. Type I and Type II errors. α is often called a signiﬁcance level or sensitivity. more rigorous tests can be done with simple hypotheses. If a hypothesis is stated such that it speciﬁes the entire distribution.1 INTRODUCTION TO STATISTICAL TESTS 3 N (µ = 0. we may reject the null hypothesis when actually it was generated from the N (0. There is no way to completely eliminate both kinds of errors. σ = 1) N (?. we try to ﬁx an accepted level. because they specify the entire distribution. and just happens to be +4σ away from the mean. and as complicated or as simple as we wish. and go on to ﬁnd ways of minimizing the level of Type II error. Similarly. 1) distribution. we call it a composite hypothesis. and just happens to be −3σ away from the mean. We accept the null hypothesis when it’s actually false. Typically. We reject the null hypothesis when it’s actually true. It should be noted that H0 and HA can be almost anything. σ = 1) ? Figure 1: Numbers drawn from two diﬀerent standard normal distributions are thrown into John Chodera’s hat. • Type II (false negative). two kinds of error could occur: • Type I (false positive). • α: the probability of making a Type I error (false positive). β.

call it x∗ . or anything greater. We want to choose a critical value of x.05. and the null hypothesis is that x comes from a standard normal probability distribution P (x) = N (µ = 0.05 Remaining 5% of the area x∗ x Figure 2: For a null hypothesis that assumes that our statistic x is drawn from a standard normal distribution. σ 2 = 1). we need to choose x∗ such that the probability of observing our statistic’s value.1 INTRODUCTION TO STATISTICAL TESTS 4 The statistical power of a test is deﬁned as (1 − β). P (x) 95% of the area is to the left of the threshold ≈ +2σ α = 0. it turns out that if you choose a critical value of x∗ ≈ +2σ. a α = 0. 1. Then the p-value for your statistic is: ∞ px† = P (x ≥ x† |H0 ) = P (x)dx x† .2 p-values Let’s say we have a statistic x.05 will result. choosing a threshold value of x∗ = +2σ gives a signiﬁcance level of about α = 0. To reach a target signiﬁcance level α that we have chosen beforehand. We usually want to maximize the power of our test in order to detect as many signiﬁcant signals from our data as we possibly can. that we can use as a criterion rejecting the null hypothesis: • CRITERION: Reject the null hypothesis if x ≥ x∗. The p-value is the way to test against this signiﬁcance level α. is α: ∞ P (x)dx = α x∗ For a standard normal distribution. Suppose you measure your statistic to be x† .

Keep in mind that they are all used according to this same general recipe. then you might want to consider rejecting the null hypothesis. More about this in the next section. Given this assumption. and (usually) a known distribution of this statistic. there are two kinds of statistical tests. Suppose we have N independent variables. Classical statistics mostly deals with parametric tests. α. If the p-value is less than α.2 PARAMETRIC TESTS 5 If the p-value is less than α. These are tests which assume some sort of model for the underlying distribution that the sample data is drawn from. each normally distributed with mean µi and variance σi .) This is the general way most statistical tests are applied.). Many of the statistical distributions used in these tests assume that the data is drawn from a normal (Gaussian) distribution. pick an appropriate signiﬁcance level. then reject the null hypothesis. calculate (or look up from a table) the p-value of your observed statistic. This can be very useful in cases where it would be very hard to justify that the data are normally-distributed (or if we know it’s just plain not true). which is the null hypothesis. P (x† ). You ﬁrst: 1. much can be derived about the distribution of the observations themselves. (I say “usually” because in some case you have to build this distribution empirically from the data). The random variable N Z= i (xi − µi )2 2 σi . rather it is the probability of observing a value of x that is at least x† or larger. In the sections below we will describe oﬀ number of diﬀerent kinds of statistical tests. xi . 2.1 Parametric Tests Parametric versus Non-parametric Tests In general. P (x ≥ x† ). Each statistical test consists of a statistic that you compute.2 The χ2 -test The χ2 -test is a useful statistic for calculating “goodness of ﬁt”. 3. (Note: the p-value is not the probability of observing this data. Many non-parametric tests can quite powerful simply by considering the rank order of the observations. If the p-value is considerably less than α. then the statistic is highly signiﬁcant (at least compared to the signiﬁcance of the hypothesis test). and then 2. Non-parametric tests do not assume any kind of underlying probability distribution. 2 2.

the best guess we have for the variance is the sample variance.2 PARAMETRIC TESTS 6 Parametric Tests Non-Parametric Tests Assumes a model for the underlying distribution of a statistic (often Gaussian) No assumptions about underlying distribution Typically deals with distributions of actual observations Typically deals with the rank ordering of observations Figure 3: Some diﬀerences between parametric and non-parametric tests. normally-distributed variables xi with mean µ and variance σ 2 . s2 : 1 s = N −1 2 N (xi − µ)2 i=1 (1) In this case.) If you have a set of observations Oi .3 The ‘Student’ t-test The t-distribution is especially useful for testing the statistical signiﬁcance of sample means and sample variances. the following value also has a distribution that is a very good approximation to a χ2 -distribution: N (Oi − Ei )2 Z= Ei i You can estimate the statistical signiﬁcance of your results by looking up p-values for the χ2 -distribution. if we knew the true variance σ. (See the Statistics Primer notes for a detailed description of this. But if all we have is a sample of data. Suppose we have a set of N independent. then it is straightforward to show that the renormalized variable Z= x−µ ¯ √ σ/ N is normally distributed with mean 0 and variance 1. Instead. the quantity Z= x−µ ¯ √ s/ N is not quite distributed as a standard normal. 2. it is distributed according to a t-distribution with (N − 1) degrees of freedom (Again. see the Statistics Primer notes for the functional form of this distribution. Now. and a set of expected values for each of your observations Ei . is distributed according to a χ2 -distribution with N degrees of freedom.) .

It can then be shown that the quantity Z= (¯ − µx ) − (¯ − µy ) x y Sp 1/N + 1/M is distributed as a t-distribution with (n + m − 2) degrees of freedom. (1 − β). However. A similar reasoning as above is used here: Consider two sets of sample data: a set of N independent. respectively. 2.e. then it can be shown that hypothesis tests using likelihood ratios give the most amount of statistical power. you calculate the statistic Z and ﬁnd the p-value for it according to the t-distribution.4 The two-sample t-test Perhaps the most widely-used application of the t-distribution is to test whether or not two samples come from a distribution with the same mean. Because he was publishing trade secrets. Again. we must estimate it using the pooled sample variance. Trivia. ‘Student’. The ‘Student’ t-statistic is sometimes called the Guiness t-test. worked for the Guiness brewery in Dublin. Like before. and a set of M independent. . and formulated it to measure the statistical consistency across batches of brewed beer. normally-distributed variables yi with mean µy and variance 2 σy .2 PARAMETRIC TESTS 7 The t-test is a statistical test that uses the sample mean and sample variance to determine whether or not a given sample comes from a normal distribution. then it is straightforward to show that the renormalized variable Z= (¯ − µx ) − (¯ − µy ) x y σ 1/N + 1/M is distributed as a standard normal distribution. since we don’t know this. Sp 2 : Sp 2 = (N − 1)sx 2 + (M − 1)sy 2 n+m−2 where sx 2 and sy 2 are the sample variances for xi and yi . normally-distributed variables xi with mean µx 2 and variance σx . 2. he was forced to use a pen name. To use the test. because its inventor. to use the test. if we knew the true variance σ. William Sealy Gosset. they specify the entire probability distribution P (x)). you calculate the statistic Z and ﬁnd the p-value for it according to the t-distribution.5 Likelihood Ratios In the case that both the null hypothesis H0 and the alternative hypothesis HA are simple (i.

. when this ratio surpasses LR > 1. we reject the null hypothesis. LR = "'! p(x|HA ) p(x|H0 ) " #') #'( #'% N (µ = 0. In the K-S test. because it only has statistical power in the context of very small sample sizes.6 The Kolmogorov-Smirnov test The Kolmogorov-Smirnov (K-S) test is used to detect any diﬀerence between two distributions. Since the logarithm is a monotonic function. and HA is that the number is from N (2. 1) (blue). σ 2 = 1) #'! # !! !" # " ! $ % & x Figure 4: Here. the null hypothesis H0 is that a sample variable x comes from a parent distribution P (x).2 PARAMETRIC TESTS 8 The likelihood ratio (LR) for a statistic x is deﬁned as: LR = p(x|HA ) p(x|H0 ) In the example shown in Figure 4. depending on the desired level of signiﬁcance. 1) (red). LR tests can be quite powerful. The K-S statistic is deﬁned as: D= −∞<x<∞ max |SN (x) − P (x)| where SN1 (x) is the cumulative probability distribution (CDF) built up from the sampled data (see Figure 5). and P (x) is the cumulative distribution hypothesized for H0 . often times it is more practical to consider the log-likelihood ratio (LLR). 2. It is somewhat limited. H0 is that a number picked from our hat is a standard normal N (0. We’ve chosen the point where the likelihood ratio (LR) ≥ 1 as the best threshold for rejecting the null hypothesis. α. This can be adjusted. σ 2 = 1) N (µ = 2. and the alternative hypothesis HA is that it does not come from P (x). When combined with Bayesian statistics.

The same statistic applies: D= −∞<x<∞ max |SN1 (x) − SN2 (x)| To a useful approximation. and QKS is deﬁned as: ∞ QKS (λ) = 2 j=1 (−1)j−1 e−2j 2 λ2 Figure 5: The K-S statistic D measures the maximum diﬀerence between the sample’s CDF. and the p-value for a measured D† is given by: P (D ≥ D† ) = QKS Ne + 0. one may wish to compare two samples directly. by calculating the maximum diﬀerence between the two diﬀerent cumulative distribution functions SN1 (x) and SN2 (x). This is especially true if the data set contains outliers (greatly aﬀecting the sample variance). . in order to determine which gene are diﬀerentially expressed in the disease. It would be nice to apply a parametric test like the two-sample t-test for each gene.3 NON-PARAMETRIC TESTS 9 Alternatively. and Ne = N1 N2 /(N1 + N2 ) for the two-sample case. The problem is that gene expression data is notoriously non-normal and skewed such that we can’t trust the t-test – the assumption of normally-distributed data is not valid. the K-S-distribution of the D statistic can be calculated. and a theoretical CDF. and the second group is 59 patients who didn’t. 3 Non-parametric tests Suppose we have gene expression data for 1000 genes suspected to be involved in breast cancer from two groups of patients.11/ Ne D where Ne = N for the case of a single sample. The ﬁrst group is 41 patients who metastasized. SN (x).12 + 0. P (x).

The total of these counts is the U -statistic. 1 ≤ i ≤ N and {yj }. ¯ ¯ 3. 3.3 NON-PARAMETRIC TESTS 10 This is the kind of problem that may beneﬁt from non-parametric tests.1 Wilcoxon rank-sum test The procedure for calculating the U -statistic for the Wilcoxon rank-sum test (also known at the Mann Whitney U -test) is very conceptually simple. Consider two sets of data: N samples of xi and M samples of yi . One way to test this is to construct a probability distribution that corresponds to the null hypothesis. 1 ≤ i ≤ N and {yj }. 2. {xi }. {xi }. count the number of observations in sample 2 that are smaller than it. Suppose we want to test the null hypothesis: • H0 : the sample means of the expression data for the two patient groups are drawn from the same distribution. but which is built empirically from the data. This can be done by repeatedly permuting the category labels and compiling a histogram of the sample means x and y . rank all of the values in the combined set of all N + M samples. Just like the other tests. 3. Choose the sample for which the ranks seem to be smaller (the choice is relevant only to ease of computation). consider just one particular gene’s expression data for the two groups of patients. 1 ≤ j ≤ M . The U -test is included in most modern statistical packages. 4. and call the other sample ”sample 2”. Suppose we want to construct some statistical test based on the sample mean x ¯ . Call this ”sample 1”.3 Bootstrapping: resampling with replacement Sometimes we are very cautious about making any assumptions about the underlying distribution of sampled data. However. Again. 3.2 Permutation tests Consider just one particular gene’s expression data for the two groups of patients. First. you must look up the p-value of your measured U statistic from a table compiled from the U -distribution. one recourse is to make a very weak assumption and construct a hypothetical distribution from the data itself. 1. Taking each observation in sample 1. We’ll use this example to illustrate the following non-parametric tests. from 1 to (N + M ). 1 ≤ j ≤ M .

but instead of resampling. like you’re “picking yourself up by your bootstraps”. The robustness of calculated expectation values over time series is a good example of this. This is especially useful in characterizing how robust a given set of data is in supporting a conclusion. various subsets of the data are systematically removed. likewise P (¯) can be constructed directly be sampling with y replacement from {yj }. . (You can see why they call it “bootstrapping” – it’s almost like you’re getting something for nothing. A bootstrapped distribution P (¯) can be constructed by directly sampling x with replacement from the set of values {xi }. These can then be compared to null hypothesis distributions P0 (¯) and P0 (¯) which x y have been constructed by sampling with replacement from {xi } ∪ {yj }. and the distribution of the resulting variations are studied.3 NON-PARAMETRIC TESTS 11 across the group of patients.) 3.4 Jackkniﬁng Jackknifed statistics are similar to bootstrapping.

Sign up to vote on this title

UsefulNot useful- when to use and not use a threshold p value- statistics brief
- Notes 6
- Statistics Tutorial
- Chapter 7 Test
- sampling distribution
- AP Statistics Problems #22
- Power Ejemplo
- New Stat 501 Bulk
- MB0034(Research Methodology)...
- Summary of Chapter 13 Fall 2009
- 0.05
- MB0034 Research Methodology
- QNT 561 Weekly Learning Assessments | Questions and Answers @ UOP E Assignments
- Comparative Analysis of Share Prices of Hotels In
- Statistics Homework 1 - 8/11
- lec04-BayesianLearning
- Chapter 06 W9 L6 L7 Hypoyhesis Testing C6
- Assignment 1 UTX
- Project Final IIEST
- Graphic Organizers - Hypothesize
- ROeS2007_Zehetmayer
- 2004 05 ECCV04 Hand Tracking 2d
- what is research.ppt
- Statistics and Probability Homework Sheet
- Dissertation
- Strategic Program Development and Examination Malpractices
- 06 Probability Distribution
- The Effects of Situational Factors on in-Store Grocery Shopping Behavior
- Chapter13 New
- Research Methodology & Technical Communication (HSS-501) RCS
- hypotesting