## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

**SAMPLES-THE NEXT GENERATION OF MOLECULAR
**

MARKERS

Authors and affiliations

Andreas Futschik

(1,∗)

and Christian Schl¨otterer

(2)

(1) : Department of Statistics, University of Vienna, Vienna, Austria

(2) : Institut f¨ ur Populationsgenetik, Veterin¨armedizinische Universit¨at Wien,

Wien, Austria.

(∗) : corresponding author.

1

Genetics: Published Articles Ahead of Print, published on May 10, 2010 as 10.1534/genetics.110.114397

2 NGS EXPERIMENTS WITH POOLED SAMPLES

Abstract

Next generation sequencing (NGS) is about to revolutionize genetic analysis.

Currently NGS techniques are mainly used to sequence individual genomes. Due

to the high sequence coverage required, the costs for population scale analyses

are still too high to allow an extension to non-model organisms. Here, we show

that NGS of pools of individuals is often more eﬀective in SNP discovery and

provides more accurate allele frequency estimates, even when taking sequencing

errors into account. We modify the population genetic estimators Tajima’s π and

Watterson’s θ to obtain unbiased estimates from NGS pooling data. Given the

same sequencing eﬀort, the resulting estimators often show a better performance

than those obtained from individual sequencing. Although our analysis also shows

that NGS of pools of individuals will not be preferable under all circumstances, it

provides a cost eﬀective approach to estimate allele frequencies on a genome-wide

scale.

NGS EXPERIMENTS WITH POOLED SAMPLES 3

Next generation sequencing (NGS) is about to revolutionize biology. Through

a massive parallelization, NGS provides an enormous number of reads, which per-

mits sequencing of entire genomes at a fraction of the costs for Sanger sequencing.

Hence, for the ﬁrst time it has become feasible to obtain the complete genomic

sequence for a large number of individuals. For several organisms, including hu-

man, D. melanogaster and A. thaliana, large re-sequencing projects are well on

their way. Nevertheless, despite the enormous cost reduction, genome sequencing

on a population scale is still out of reach for the budget of most laboratories.

The extraction of as much statistical information as possible at cost as low as

possible has therefore already attracted considerable interest. See for instance

Jiang et al. (2009) for the modeling of sequencing errors and Erlich et al. (2009)

for the eﬃcient tagging of sequences.

Current genome-wide re-sequencing projects collect the sequences individual

by individual. In order to obtain full coverage of the entire genome and to have

high conﬁdence that all heterozygous sites were discovered, it is required that

genomes are sequenced at a suﬃciently high coverage. As many of the reads

only provide redundant information, cost could be reduced by a more eﬀective

sampling strategy.

In this report, we explore the potential of DNA pooling to provide a more

cost-eﬀective approach for SNP discovery and genome wide population genetics.

Sequencing a large pool of individuals simultaneously keeps the number of redun-

dant DNA reads low, and provides thus an economic alternative to the sequencing

of individual genomes. On the other hand, more care has to be taken to establish

an appropriate control of sequencing errors. Obviously haplotype information is

not available from pooling experiments, but this will often be outweighed by the

increased accuracy in population genetic inference.

Focusing on biallelic loci, our analysis shows that with suﬃciently large pool

sizes, pooling usually outperforms the separate sequencing of individuals, both

for estimating allele frequencies and inference of population genetic parameters.

When sequencing errors are not too common, pooling seems also to be a good

choice for SNP detection experiments. To avoid the additional challenges en-

countered with individual sequencing of diploid individuals, we compare pooling

with individual sequencing of haploid individuals. See Lynch (2008, 2009) for

a discussion of next generation sequencing of diploid individuals. Our results

for the pooling experiments should be also applicable to a diploid setting, as we

are just merging pools of size 2 to a larger pool in this case, leading to a pool

size of n = 2n

d

for n

d

diploid individuals. In the methods section, we derive

several mathematical expressions that permit us to compare pooling with sep-

arate sequencing of individuals. These formulas are then applied in the results

section in order to illustrate the diﬀerences in accuracy between the approaches.

A reader who is only interested in the actual diﬀerences under several scenarios

might therefore want to move directly to the results section.

4 NGS EXPERIMENTS WITH POOLED SAMPLES

Methods

Throughout, we will consider an individual sequencing project where k indi-

viduals are sequenced each with an expected coverage λ, by which we mean that

any given locus is sequenced λ times on average. For a comparable pooling exper-

iment that involves the same amount of sequencing eﬀort, the expected coverage

will then be kλ, i.e. any particular locus will be read kλ times on average from

the pool consisting of n individuals. In practice, one might for instance sequence

each of the k individuals on a separate Illumina lane with coverage λ. With the

same sequencing eﬀort, the pool could be sequenced on k lanes simultaneously,

leading to a total coverage of kλ.

For the convenience of the reader, we summarize our notation in Table 1.

SNP detection. A SNP is detected at a site if the site is polymorphic, i.e. if

at least two alleles A and a are found in the sequenced sample. We will consider

SNP detection both in the context of pooling experiments and for individual

sequencing. To assess the performance of these two competing scenarios, we will

look both at the power and at the probability of falsely calling a SNP due to

sequencing errors.

Generally speaking, an experimental design that provides high power while

keeping the probability of incorrectly detecting a SNP small, will be preferable.

When individuals are sequenced separately, the probability of sequencing errors

being interpreted as true SNPs can be reduced by a suﬃciently high expected

coverage if the genotype of an individual is inferred by the majority of reads.

Note, that in the case of diploid individuals, the distinction between sequencing

errors and true SNPs is signiﬁcantly more complicated. In pooling experiments, a

simple way to control the probability of falsely detecting SNPs both in the haploid

and in the diploid case is to require a certain minimum number of reads for the

minor allele in order to call a SNP. We extend work by Eberle and Kruglyak

(2000) on SNP detection, and derive both the power and error rates for pooling

experiments and for separate sequencing.

Separate Sequencing of Individuals. Let M

A

, (M

a

) denote the number of

times allele A (a) is sequenced. Given that exactly L

A

= l out of k individuals in

the sample have an allele of type A, the probability of detecting polymorphism is

equal to the probability of reading at least one of the A and one of the remaining

a alleles in the sample. Assuming that for each individual the number of reads at

a particular locus is Poisson distributed with parameter λ, the probability of not

covering the SNP locus for an individual is exp(−λ). This leads to the probability

q

c

(l, k, λ) := (1 −[exp(−λ)]

l

)(1 −[exp(−λ)]

k−l

)

for getting at least one “A” and one “a” read. Notice that for larger values of

λ, this probability is nearly one, except for l = 0 or l = k, where q

c

(l, k, λ) = 0.

Suppose now that our population size N is fairly large and that the relative

frequency of allele A is p in the population. Then, by conditioning on the number

NGS EXPERIMENTS WITH POOLED SAMPLES 5

l of A alleles in the sample, the probability of detecting a SNP is approximately

(1) q(p, k, λ) =

k−1

¸

l=1

q

c

(l, k, λ)

k

l

p

l

(1 −p)

k−l

.

For large values of λ, we obtain that

(2) q(p, k, λ) ≈ 1 −p

k

−(1 −p)

k

.

We will now derive the probability of wrongly detecting a SNP due to sequencing

errors. A natural way to proceed for individual sequencing is to assume that the

most frequently read base for an individual is the true one. The probability that

this leads to the wrong decision depends on the number of reads available for

the locus under investigation, as well as the probability ǫ that a single read for

a given base is incorrect and furthermore on whether the errors are independent.

Concerning the dependence of the reading errors, we consider two extreme sce-

narios. The ﬁrst, more pessimistic scenario, assumes complete dependence such

that sequencing errors at a given position always lead to the same incorrect base.

The second assumes independent errors such that each sequencing error leads to

an independently chosen wrong base. In this situation, we assume furthermore

that the three possible wrong bases are chosen with the same probability. We

expect the actual error probabilities somewhere between these scenarios.

For the dependent case, we obtain by conditioning on the (Poisson) number of

reads for an individual at a locus

(3) q

(d)

e

(k, λ, ǫ) = 1 −

¸

1 −

¸

r≥1

¸

i>r/2

r

i

ǫ

i

(1 −ǫ)

(r−i)

¸

¸

λ

r

r!

exp(−λ)

k

.

In the independent case, an error is made by choosing one of the three incorrect

bases at random, each with probability ǫ/3. The probability of falsely detecting

a SNP is

(4)

q

(i)

e

(k, λ, ǫ) = 1 −

¸

1 −

¸

r≥1

3

¸

i>r/2

r

i

(ǫ/3)

i

(1 −(ǫ/3))

(r−i)

¸

¸

λ

r

r!

exp(−λ)

k

.

The resulting error probabilities can be made very small by ensuring a coverage

λ that is large enough. Obviously a more sophisticated rule will be needed when

sequencing diploid individuals.

Pooling Experiment. We now assume that a pooled sample of size n is se-

quenced with the same expected total number kλ of reads per locus as for sep-

arate sequencing. Let F

(P)

(b, γ) =

¸

b

i=0

γ

i

i!

exp(−γ) denote the probability that

a Poisson random variable with parameter γ is at most b. Given a frequency

L

A

= l of A alleles in the pool, we obtain the probability of reading at least one

6 NGS EXPERIMENTS WITH POOLED SAMPLES

A and one a allele as

(5)

1 −F

(P)

(0,

lkλ

n

)

1 −F

(P)

(0,

(n −l)kλ

n

)

.

Now this leads to the probability of detecting a SNP

(6)

n−1

¸

l=1

1 −F

(P)

(0,

lkλ

n

)

1 −F

(P)

(0,

(n −l)kλ

n

)

n

l

p

l

(1 −p)

n−l

which occurs with a proportion p in the population.

As sequencing errors are common in NGS, they are easily confounded with low

frequency alleles. A common strategy to reduce the high probability of sequencing

errors is to consider only SNPs that are detected in at least b reads. Requiring a

minimum number b of reads in our context, the probability of detecting a SNP

changes to

(7)

n−1

¸

l=1

1 −F

(P)

(b −1,

lkλ

n

)

1 −F

(P)

(b −1,

(n −l)kλ

n

)

n

l

p

l

(1 −p)

n−l

.

As with individual sequencing, we again derive the probability of wrongly detect-

ing a SNP under two scenarios for the sequencing errors.

In the dependent scenario, the probability of wrong SNP detection equals the

probability

(8) p

(d)

e

(k, λ, ǫ, b) = (1 −F

(P)

(b −1, λkǫ))[1 −F

(P)

(0, λk(1 −ǫ))]

of making at least b sequencing errors and getting at least one correct read. If

the expected number of reads λk is fairly large, the term 1 − F

(P)

(0, λk(1 −ǫ))

is very close to one and can be omitted without changing the results much. With

independent sequencing errors, an upper bound for the probability of falsely

detecting a SNP is given by

(9) p

(i)

e

(k, λ, ǫ, b) = 3

1 −F

(P)

(b −1, λkǫ/3)

.

Allele frequency inference. We consider a locus with expected relative fre-

quency p in the population. Suppose ﬁrst that the individuals are sequenced

separately with an expected coverage of λ. Then the probability that a speciﬁc

locus is read for J = j of the k individuals is

r

j,k

:=

k

j

(1 −e

−λ

)

j

e

−(k−j)λ

.

Given that reads are available for J = j out of the k individuals, the relative

frequency of A alleles is R

c

:= M

A

/j. The variance of R

c

can be obtained as

Var(R

c

) = Var

M

A

J

= E

Var

¸

M

A

J

|J

+ Var

E

¸

M

A

J

|J

.

NGS EXPERIMENTS WITH POOLED SAMPLES 7

Now given J, M

A

is binomial B(J,p) distributed and Var

M

A

J

|J

= p(1 − p)/J.

This leads to

E

Var

¸

M

A

J

|J

= E(1/J)p(1 −p).

Furthermore E[

M

A

J

|J] = p and therefore Var

E[

M

A

J

|J]

= 0. Together

Var(R

c

) = p(1 −p)E(1/J) ≥ p(1 −p)/k.

We now turn to the pooling experiment, assuming again a population propor-

tion, p, of A alleles. With L

A

again denoting the number of A alleles in a pooled

sample of size n, we assume M

A

(M

a

) reads of the A (a) allele from this sample.

This leads to M = M

A

+ M

a

reads for the site under investigation.

The relative frequency of the A allele estimated from the sample is then given

as R

p

= M

A

/M. According to our model M is Poisson Pois(kλ), and with U =

(M, L

A

), M

A

|U is binomial B(M, L

A

/n). We again decompose the variance into

Var(R

p

) = Var

M

A

M

= E

Var

¸

M

A

M

|U

+ Var

E

¸

M

A

M

|U

.

Now Var

M

A

M

|U

=

1

M

L

A

n

n−L

A

n

and E

M

A

M

|U

=

L

A

n

. Together, we obtain

Var(R

p

) = E

1

M

n −1

n

p(1 −p) +

p(1 −p)

n

.

In order to see which experimental setup leads to the smaller variance, we

consider the ratio

(10)

Var(R

p

)

Var(R

c

)

=

E

1

M

n−1

n

+

1

n

E(1/J)

.

It is convenient that the ratio does not depend on the population proportion p of A

alleles anymore. For a large enough expected coverage λ we get E(1/J) ≈ 1/k and

E(1/M) ≈ 1/(kλ). Notice that the variance for the pooling experiment increases

when individuals contribute unequal amounts of probe material. According to

our simulations shown in the results section however, this variance component

can be kept small by choosing pools of large enough size.

Allele frequency estimators for pooled samples that also take into account

quality scores of the individual reads have been discussed in Holt et al. (2009).

The computation of variances for these estimators would depend on the speciﬁc

assumptions of a probability model for the quality scores.

Estimating population genetic parameters. Two widely used summary sta-

tistics in population genetics are Tajima’s π and Watterson’s θ. We investigate

the inﬂuence of the two sequencing strategies on the accuracy of these summary

statistics. According to our simulations, both summary statistics show a sig-

niﬁcantly smaller variance for pooled samples. However, in particular for small

pools, the estimators show some bias. The reason for the bias is that multiple

8 NGS EXPERIMENTS WITH POOLED SAMPLES

reads of the same sequence are entering the normalizing constant as indepen-

dently sampled sequences, if the estimators are computed in a standard way for

pooled samples. Sequencing errors also lead to bias, and if a minimum minor

allele frequency is required in order to make sequencing errors rare, this needs to

be taken into account. For individual sequencing, the eﬀect of omitting single-

tons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008).

Based on the expected values of Tajima’s π and Watterson’s θ, we introduce

modiﬁed normalizing constants that make the resulting estimators unbiased un-

der neutrality. These bias corrected estimators will then be compared with those

obtained from individual sequencing. (See the RESULTS section.)

We ﬁrst derive a bias correction for Tajima’s π and start by considering a

locus for which M reads are available. We do not consider sequencing errors for

the moment, and focus on the bias that is caused by possibly reading the same

sequence more than once. Let ∆

ij

denote the number of diﬀerences between the

sequences i and j at this locus that are selected randomly with replacement from

the pool of n individuals. Now for this locus

E

ˆ

θ

π

= E

¸

i =j

∆

ij

/

M

2

= E∆

IJ

= E[∆

IJ

|I = J]P(I = J) +E[∆

IJ

|I = J]P(I = J)

= 0 +θP(I = J)

= θ

n −1

n

. (11)

Therefore

n

n−1

ˆ

θ

π

will be unbiased, if we neglect sequencing errors. Since this

bias correction only depends on the size n of the pool and not on the coverage

by reads, a bias corrected version of Tajima’s π for the entire sequence can be

obtained by adding up individual values of

ˆ

θ

π,l

for all loci and then multiplying

by

n

n−1

, leading to

ˆ

θ

∗

π

=

n

n−1

¸

l

ˆ

θ

π,l

.

In order to also correct for sequencing errors, two approaches seem feasible. If

an unbiased estimate for the sequencing errors is available, such an estimate could

be used to correct

ˆ

θ

∗

π

. Analogous to Achaz (2008, equation (1)) for the standard

experimental setup,

ˆ

θ

∗

π

−2

n

n−1

ˆ µ

err

will be unbiased, if ˆ µ

err

is an unbiased estimate

of the number of reading errors per sequence. Introducing ˆ µ

err

will obviously add

to the variance of the resulting estimator and the overall performance will depend

on the accuracy of ˆ µ

err

. Another way to take into account sequencing errors is

to require a minimum minor allele frequency b for including a segregating site

in the analysis, and to ignore sequencing errors subsequently. The idea is that

sequencing errors will be rare if b is suﬃciently large.

Again, we ﬁrst consider a locus for which the coverage is equal to M. Let

ˆ

θ

(b)

π

denote the version of Tajima’s π where the minor allele frequency is required to

NGS EXPERIMENTS WITH POOLED SAMPLES 9

be at least b. Notice that

ˆ

θ

(b)

π

=

ˆ

θ

π

for b = 1. With K

m

denoting the number of

sites where the derived allele A has frequency m,

ˆ

θ

(b)

π

may be written as

ˆ

θ

(b)

π

=

M

2

−1 M−b

¸

m=b

K

m

m(M −m)

for a locus for which M reads are available (see section 1.4 in Durrett (2008)).

Let c

n

= 1/

¸

n−1

i=1

i

−1

, and let furthermore X

M

denote the number of A alleles

among the reads, and Y

n

the number of A alleles in the pool. Then

P(X

M

= m|Y

n

= r) =

M

m

(r/n)

m

(1 −r/n)

M−m

and under neutrality P(Y

n

= r) = r

−1

/c

n

. With c

n

θ being the expected number

of segregating sites in the pool,

(12) E(

ˆ

θ

(b)

π

) =

M

2

−1

c

n

θ

M−b

¸

m=b

n−1

¸

r=1

m(M −m)P(X

M

= m|Y

n

= r)P(Y

n

= r)

For b = 1, straightforward calculations reproduce (11), i.e.

E(

ˆ

θ

(1)

π

) = θ

n −1

n

.

For b > 1 the sum does not simplify much, but can be computed and turned into

the bias correction factor

M

2

[

M−b

¸

m=b

n−1

¸

r=1

m(M −m)P(X

M

= m|Y

n

= r)r

−1

]

−1

.

However, an accurate approximation for (12) can be obtained by assuming that

n is large compared to M. In this case

n−1

¸

r=1

P(X

M

= m|Y

n

= r)P(Y

n

= r) ≈ c

−1

n

1

m

for 1 ≤ m ≤ M −1 and therefore

(13) E(

ˆ

θ

(b)

π

) ≈ θ

M −2b + 1

M −1

.

For b > 1, the resulting simple bias correction factor

M−1

M−2b+1

turns out to provide

very good approximations, even if the pool size n is only moderately larger than

the number of reads M. Indeed, if singletons are omitted (b = 2), then the relative

error is only 0.4% when M = 10 and n = 20. For n = 200 and M = 50, the error

drops to 0.02% for b = 2 and 4 · 10

−5

% for b = 3. Summarizing, we propose the

following bias corrected version of Tajima’s π :

(14)

ˆ

θ

(b)∗

π

=

n

n−1

ˆ

θ

π

for b = 1,

M−2b+1

M−1

ˆ

θ

(b)

π

for b > 1.

10 NGS EXPERIMENTS WITH POOLED SAMPLES

To obtain an overall estimate based on L loci with possibly unequal coverage M

l

(1 ≤ l ≤ L), simply take the sum over the individually bias corrected estimates

(15)

ˆ

θ

(b)∗

π

=

L

¸

l=1

ˆ

θ

(b)∗

π,l

.

Dividing

ˆ

θ

(b)∗

π

by the total length of the considered sequence, an estimator for the

scaled mutation parameter per base results.

We now derive a bias correction for Watterson’s estimator, again ﬁrst focusing

on a locus with coverage M. We consider a version of Watterson’s estimator that

requires a minimum minor allele frequency b. For b = 1 we use all segregating

sites, and versions that protect against sequencing errors can be obtained by

choosing b > 1. Let S

b

denote the number of segregating sites found in the M

sequence reads from the pool for which the minor allele frequency is at least b.

Then

(16)

ˆ

θ

(b)

W

:=

S

b

¸

M−1

i=1

1/i

provides protection against sequencing errors. if b is large enough. Analogous to

(12), we obtain that conditional on the number of reads M for the locus

(17) E(

ˆ

θ

(b)

W

|M) =

c

n

c

M

θ[

M−b

¸

m=b

n−1

¸

r=1

P(X

M

= m|Y

n

= r)P(Y

n

= r)].

Let F

(B)

(x, M, p) denote the probability that a binomial random variable X sat-

isﬁes P(X ≤ x) for M trials with success probability p. In particular for p = r/n,

F

(B)

(x, M, r/n) =

x

¸

i=0

M

i

r

n

i

1 −

r

n

M−i

.

Recall furthermore that c

M

=

¸

M−1

i=1

i

−1

. Then a bias corrected version of

ˆ

θ

(b)

W

for b ≥ 1 is given as

(18)

ˆ

θ

(b)∗

W

=

ˆ

θ

(b)

W

c

M

¸

n−1

r=1

[F

(B)

(M −b, M, r/n) −F

(B)

(b −1, M, r/n)]

1

r

.

As with Tajima’s π,

ˆ

θ

(b)∗

W

can be easily adapted to work with longer sequences.

For this purpose, partition the sequence into L loci such that for each locus a

constant number of reads M

l

is available, and obtain the bias corrected Watterson

estimate

ˆ

θ

(b)∗

W,l

separately for each locus l. Then

(19)

ˆ

θ

(b)∗

W

=

L

¸

l=1

ˆ

θ

(b)∗

W,l

,

NGS EXPERIMENTS WITH POOLED SAMPLES 11

provides an estimate of the overall scaled mutation parameter. Dividing

ˆ

θ

(b)∗

W

by

the total length of the considered sequence, an estimator for the scaled mutation

parameter per base results.

Results

SNP Detection.

For many biological applications SNP genotyping provides a cost eﬀective ap-

proach, and SNP discovery is the ﬁrst step required. We compared the eﬃciency

of SNP discovery using an approach in which each individual is sequenced sepa-

rately with a pooling approach. The panels of Figure 1 show that the comparative

eﬃciency of pooling depends both on the expected coverage and on the minimum

number of reads for allele calling used for error protection. While pooling experi-

ments provide a higher probability of SNP detection in most cases, it is expected

to be less eﬃcient, if both the coverage is small and a a high minimum number

of reads is required. This is not entirely unexpected, since an increased number

of reads required for the inference of the minor allele reduces the probability of

detecting SNPs in a pooling experiment. The higher the expected coverage, the

more ineﬃcient individual sequencing becomes. As long as not chosen too small,

the size of the pool seems to play a less important role. Figure 2 addresses the

problem of wrongly identifying a sequencing error as a SNP. Irrespective of the

assumed model of sequencing errors (see methods section for further details), a

high probability of sequencing errors makes SNP calling from pools highly unre-

liable. On the other hand, if sequencing error rates are reduced (e.g. by quality

ﬁltering), a suitable lower bound on the minimum allele frequency for detecting

a SNP makes pooling very reliable for the identiﬁcation of SNPs. Interestingly,

in some cases, we found pooling to result in fewer erroneous SNP calls than

individual sequencing.

Allele frequency inference.

In population genetics, the allele frequency spectrum is of central interest. Es-

timating the allele frequency spectrum of a population is subject to sampling

variation. In an individual based sequencing strategy, most of the sampling vari-

ation comes from the selection of individuals used for DNA sequencing. The

advantage of the pooling approach is that this sampling error can be dramati-

cally reduced by including a large number of individuals in the pool. On the

other hand, a second level of sampling error arises in the pooling approach from

the fact that not all chromosomes in the pool are sequenced and some chromo-

somes may be sequenced more than once. We start by discussing the situation

where individuals contribute equal amounts of probe material and refer to the

last paragraph of the section for the case when this assumption is violated.

In the methods section, we obtained expression (10) for the ratio of the vari-

ances of the estimated relative allele frequency both for a pooling experiment R

p

(pool size n), and a classical experiment with individual sequencing R

c

. For a

12 NGS EXPERIMENTS WITH POOLED SAMPLES

large enough expected coverage λ and with k individuals sequenced, this equa-

tion can be approximated by the following quick rule of thumb: Pooling will lead

to a smaller variance for those experimental setups that satisfy 1/λ +

k

n

< 1 or

equivalently n/(n−k) < λ. Thus a case where pooling provides a better estimate

of the allele frequency is when the pool contains more than twice the number

of separately sequenced individuals and the coverage λ per separately sequenced

individual is at least two. For larger pools smaller values of λ will be suﬃcient.

So far we compared the individual based and pooling strategy only for the same

number of sequenced reads. Alternatively, the superiority of the pooling approach

could be expressed by the reduction of sequencing costs. Figure 3 compares the

pooling approach to sequencing of individuals when both methods provide the

same accuracy for allele frequency estimates. Suppose that k individuals are

sequenced separately, each at an expected coverage λ. Then k

∗

indicates the

cost in single genome sequencing equivalents that results in the same accuracy

as sequencing k genomes individually. If, for instance, k = 20 and k

∗

= 10, then

pooling would give the same accuracy with half the sequencing eﬀort, correspond-

ing to an individual sequencing project with 10 instead of 20 individuals. Figure 3

clearly indicates that larger pool sizes increase the advantage of sequencing pools.

A higher sequence coverage (λ) for sequencing of individuals further improves the

cost eﬀectiveness of pooling.

In genome-wide association studies (GWA’s), the association between allele

frequencies and traits (diseases) is investigated. A possible approach is to test

whether alleles have diﬀerent frequencies in two pools that diﬀer with respect to

the trait of interest, (see Sham et al. (2002)). Since the ratio of variances (10)

does not depend on the allele frequencies in the sub-populations, the standard

deviation entering the test statistic will diﬀer by the square root of (10) between

a pooling and a classical experimental setup. If the square root of (10) is 1/2

(say), the shift of the expected value of the test statistic under the alternative

will be twice as large in a pooling experiment: Overall pooling will be the more

powerful approach, whenever the variance ratio is smaller than one (see (10)). It

should be noted however that the variance of the pooling experiment will become

larger if individuals contribute unequal amounts of probe material. This issue

will be addressed in the last paragraph of this section.

Estimating population genetic parameters.

We now compare the estimation of the scaled mutation parameter using Watter-

son’s θ and Tajima’s π under our two experimental setups. For this purpose, we

simulated 100 samples under neutrality with mutation parameter θ = 10, using

the ms software (Hudson, 2002).

(./ms 500 100 -t 10 > ms.out)

where 500 is the number of sequences generated. For separate sequencing, we took

random sub-samples of size k = 10 from each sample, thus simulating separate

sequencing of 10 individuals each with an expected number λ of reads. With

NGS EXPERIMENTS WITH POOLED SAMPLES 13

pooling, we took samples of size n out of the 500 simulated sequences. From

this pool, reads were taken independently for each locus l by making a random

number of draws M

l

with replacement. The quantities M

l

were chosen according

to a Poisson distribution with expected value kλ.

For Tajima’s π, we used the bias correction (14) for individual loci and added

the estimates across loci using (15)). For Watterson’s θ, the bias has been cor-

rected using formula (18) for each locus.

Neglecting sequencing errors for the moment, it turns out that the pooling

approach with bias correction leads to more accurate estimates of θ and π, pro-

vided that the size of the pool is large enough. For small pools, multiple reads of

the same chromosome become more common, which aﬀects the accuracy of the

estimates negatively. (Figure 5.)

We now investigate the pooling approach when including a protection against

sequencing errors by removing all segregating sites where the minor allele has

frequency x satisﬁes x = 1 or alternatively x ≤ 2. Again, the normalizing

constants have been adapted in order to avoid bias. Let b denote the minimum

required minor allele frequency.

Figure 5 shows the relative advantage of pooling conditional on diﬀerent min-

imum minor allele frequencies. Pooling still leads to a decreased variance under

neutrality as long as the pool size is large enough. Not unexpectedly, the reduc-

tion in variance is now somewhat smaller for Watterson’s

ˆ

θ

(b)∗

W

. The increase in

the variance of Tajima’s

ˆ

θ

(b)∗

π

is much smaller, since frequency one minor alleles

receive a low weight in the calculation of π.

Unequal amounts of probe material. One obvious source of error in the

pooling approach is the heterogeneity in DNA amounts due to measurement

errors. In experiments that rely on PCR ampliﬁcation, the heterogeneity can

be expected to be particularly strong.

Individuals for which a larger DNA amount has been included in the DNA

pool, will be over-represented, which potentially causes a change in allele fre-

quency estimates. This aﬀects the bias and the variance also for our considered

population genetic summary statistics.

To investigate the sensitivity of population genetic estimates based on pool-

ing experiments, we simulated a scenario involving unequal amounts of probe

material. We set the expected amount of probe material to one, and allowed for

log-normally distributed multiplicative deviations from this expected value. More

speciﬁcally, the deviation factors have been chosen independently for each indi-

vidual contributing to the pool according to exp(X

i

), where X

i

(1 ≤ i ≤ n) are

normal N(0, log(scale)) random variables. Thus the median amount of probe ma-

terial is always equal to one. If the deviation factor has a value of exp(X

i

) = 1.5,

this means that the respective individual will have a 50% higher chance of being

sequenced, than another with a factor of exp(X

i

) = 1. Similarly, a value of 0.8

means a 20% decreased chance of being read.

14 NGS EXPERIMENTS WITH POOLED SAMPLES

As our ﬁrst scenario (scale = 2), slightly more than 30% of all individuals

diﬀered at least twofold from the median. In other words, for a pool of size

n = 100, the most abundant individual contributed about sixteen times the probe

material of the least abundant individual. We also simulated a more extreme

scenario (scale = 8) where about 30% of the individuals diﬀered at least eightfold

from the median. As further parameters we chose λ = 30, k = 10, n ∈ [5, 200].

As the amount of heterogeneity in the sample will usually be unknown, we

applied the same bias correction as for equal amounts of probe material. We

measured the deviation from the true θ by the mean squared error, as this ac-

counts for bias and variance.

Figure 6 displays the eﬀect of heterogeneity in probe material on the bias and

the variance of Tajima’s π and Watterson’s θ. Although both bias and variance

change noticably for higher levels of heterogeneity, these eﬀects cancel out to

a large extent. Thus the overall performance measured in terms of the mean

squared error

(20) MSE = Bias

2

+ Var

changes only marginally even for a large level of heterogeneity (scale = 8), see

Figure 7. This eﬀect can be explained by shrinkage that leads to improved es-

timates of the mutation parameter θ by permitting for some bias (Futschik and

Gach, 2008). Interestingly, even for a high level of heterogeneity in probe mate-

rial (scale =8), the performance (measured in terms of the MSE) changes only

marginally.

Heterogeneity in probe material also aﬀects the accuracy of the estimated allele

frequencies, as the variance of the estimator based on a pooled sample becomes

larger. However, this eﬀect can be kept small, by choosing a pool of a large

enough size. This is illustrated in Figure 8, where it can be seen that pooling

leads for large enough pool sizes eventually to smaller variances even for a high

level of heterogeneity in probe material (scale =8).

Discussion

Over the past decades we have been witnessing a continuous turnover of molec-

ular markers used in genetical research. To a large extent this turnover has been

driven by the advances in molecular biology and technology. With the arrival of

the second generation sequencing technologies this race is about to come to an

end - rather than relying on a more or less representative fraction of the genome,

it has come into reach to have full genomic sequences available for multiple indi-

viduals.

With further technological advances, it is anticipated that it will become pos-

sible to sequence individual genomes at a cost that allows even small laboratories

to perform population analyses on a genome scale. Currently, this is not pos-

sible as the costs are still too high. In this study, we showed that sequencing

NGS EXPERIMENTS WITH POOLED SAMPLES 15

pools of individuals provides an excellent alternative that permits genome wide

polymorphism surveys at very moderate costs.

This is the ﬁrst report systematically exploring the parameter range for which

DNA pooling provides an advantage compared to individual genome sequencing.

Our result that NGS of DNA pools often provides a reliable and cost eﬀective

mean for genome-wide allele frequency estimates, is supported by some recent

studies using NGS to analyze DNA pools of selected genomic regions. (Van Tas-

sell et al., 2008) sequenced a complexity reduced DNA pool using the Illumina

Genome Analyzer. For a subset of the identiﬁed SNPs, they compared the allele

frequency estimates from the Illumina sequencing to those obtained by genotyping

the same individuals. Despite that SNP frequency estimates were undoubtedly

aﬀected by a substantial assignment error (Palmieri and Schl¨otterer, 2009) due

to the short reads and the complexity reducing procedure, (Van Tassell et al.,

2008) observed a correlation of 0.67 between the two methods. Hence, there is

very little doubt that NGS is an eﬀective tool to provide accurate genome-wide

allele frequency estimates from DNA pools.

We anticipate that the analysis of DNA pools provides a wide range of ap-

plications. In population genetics, it will be possible to compare patterns of

diﬀerentiation on a genomic scale. Thus, patterns of local adaptation and het-

erogeneity in gene ﬂow among diﬀerent genomic regions can be identiﬁed. Also

for association mapping DNA pools are very powerful (Sham et al., 2002). In con-

trast to SNP arrays, however, re-sequencing of DNA pools will always include the

causative SNP and thus provide a higher statistical power. Our study provides

the basis for an adequate experimental design of future pooling experiments.

ACKNOWLEDGEMENTS

This work has been supported by a WWTF grant to AF and CS as well as FWF

Grants (P 19832-B11, L403-B11) awarded to CS. Special thanks to C. Kosiol, N.

de Maio, and R. Koﬂer for helpful comments on earlier versions of the manuscript.

We are grateful to A. Vasem¨agi and J. Wolf on general discussions about pooling

for NGS, and also thank the reviewers for helpful comments.

16 NGS EXPERIMENTS WITH POOLED SAMPLES

References

Achaz, G., 2008. Testing for neutrality in samples with sequencing errors. Genet-

ics 179, 1409–1424.

Durrett, R., 2008. Probabiliy models for DNA Sequence Evolution. Springer, New

York.

Eberle, M., Kruglyak, L., 2000. An analysis of strategies for discovery of single-

nucleotide polymorphisms. Genet. Epidemiol. 19, S29–S35.

Erlich, Y., Chang, K., Gordon, A., Ronen, R., Navon, O., Rooks, M., Han-

non, G. J., 2009. Dna sudoku-harnessing high-throughput sequencing for mul-

tiplexed specimen analysis. Genome Research 19, 1243–1253.

Futschik, A., Gach, F., 2008. On the inadmissibility of watterson’s estimate.

Theoretical Population Biology 73, 212–221.

Holt, K., Teo, Y., Li, H., Nair, S., Dougan, G., Wain, J., Parkhill, J., 2009.

Detecting snps and estimating allele frequencies in clonal bacterial populations

by sequencing pooled dna. Bioinformatics 25, 2074–2075.

Hudson, R. R., 2002. Generating samples under a wright-ﬁsher neutral model of

genetic variation. Bioinformatics 18, 337–338.

Jiang, R., Tavar´e, S., Marjoram, P., 2009. Population genetic inference from

resequencing data. Genetics 181, 187–197.

Knudsen, B., Miyamoto, M. M., 2009. Accurate and fast methods to estimate

the population mutation rate from error prone sequences. BMC Bioinformatics

10:247, doi:10.1186/1471–2105–10–247.

Lynch, M., 2008. Estimation of nucleotide diversity, disequilibrium coeﬃcients,

and mutation rates from high-coverage genome-sequencing projects. Mol. Biol.

Evol. 25, 2421–2431.

Lynch, M., 2009. Estimation of allele frequencies from high-coverage genome-

sequencing projects. Genetics 182, 295–301.

Palmieri, N., Schl¨otterer, C., 2009. Mapping accuracy of short reads from mas-

sively parallel sequencing and the implications for quantitative expression pro-

ﬁling. PloS one 4, e6323+.

Sham, P., Bader, J. S., Craig, I., M., O., M., O., 2002. Dna pooling: A tool for

large-scale association studies. Nature Rev. Genet. 3, 862–871.

Van Tassell, C. P. P., Smith, T. P. L. P., Matukumalli, L. K. K., Taylor, J.

F. F., Schnabel, R. D. D., Lawley, C. T. T., Haudenschild, C. D. D., Moore, S.

S. S., Warren, W. C. C., Sonstegard, T. S. S., 2008. Snp discovery and allele

frequency estimation by deep sequencing of reduced representation libraries.

Nat. Methods.

NGS EXPERIMENTS WITH POOLED SAMPLES 17

Table 1. Description of our Notation

Symbol or Notation Description

k number of haploid individuals used for separate sequencing

λ expected number of times a locus is read for an individual using

separate sequencing

n size of the pool in a pooling experiment

J random number of individuals for which reads are actually available

at a particular locus with individual sequencing (J ≤ k)

M random number of reads for a particular locus in a pooling experi-

ment (E(M) = kλ)

p relative frequency of the allele of interest in the population

F

(P)

(b, γ) Poisson cumulative distribution function (F

(P)

(b, γ) =

¸

b

i=0

γ

i

i!

exp(−γ))

F

(B)

(x, M, p) binomial cumulative distribution function (F

(B)

(x, M, p) =

¸

x

i=0

M

i

p

i

(1 −p)

M−i

.

ˆ

θ

(b)∗

π

bias corrected version of Tajima’s π for a pooling experiment when

the minor allele frequency is required to be at least b. For b = 1,

ˆ

θ

(b)∗

π

=

ˆ

θ

∗

π

.

ˆ

θ

(b)∗

W

bias corrected version of Watterson’s θ for a pooling experiment

when the minor allele frequency is required to be at least b ≥ 1.

18 NGS EXPERIMENTS WITH POOLED SAMPLES

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

A: λ=5, n=50

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

A: λ=5, n=200

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

B: λ=10, n=50

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

B: λ=10, n=200

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

C: λ=20, n=50

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

0.0 0.1 0.2 0.3 0.4 0.5

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

C: λ=20, n=200

freq in pop.

p

r

o

b

o

f

S

N

P

d

e

t

e

c

t

io

n

Figure 1. Probability of detecting a SNP with relative minor al-

lele frequency p in the population when a certain minimum number

of reads is required as a detection threshold. The colored lines in-

dicate the probabilities for sequencing experiments using a pooled

sample (purple-dashed: no error correction, red-dotted: minor al-

lele frequency (m.a.f.) at least 2, blue-dash-dot: m.a.f. ≥ 4, green

long dashed: : m.a.f. ≥ 6). Solid black line: Experiment where

k = 10 haploid individuals are sequenced separately. Expected

coverage λ = 5(A), 10(B), 20(C) per individual. For pooling ex-

periments, the expected total coverage is kλ. Pool sizes are either

50 (left column) or 200 (right column).

NGS EXPERIMENTS WITH POOLED SAMPLES 19

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

A: λ=5, errors dependent

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

A: λ=5, errors i.i.d.

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

B: λ=10, errors dependent

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

B: λ=10, errors i.i.d.

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

C: λ=20, errors dependent

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

−

8

−

6

−

4

−

2

0

C: λ=20, errors i.i.d.

log_10 sequencing error prob.

lo

g

_

1

0

−

p

r

o

b

.

o

f

f

a

ls

e

S

N

P

c

a

llin

g

Figure 2. Log-probability of falsely detecting a SNP at a non-

segregating site, in dependance on the logarithm of the sequencing

error probability. The colored lines indicate the probabilities for

sequencing experiments using a pooled sample (purple-dashed: no

error correction, red-dotted: minor allele frequency (m.a.f.) at least

2, blue-dash-dot: m.a.f. ≥ 4, green long dashed: : m.a.f. ≥ 6).

Solid black line: Experiment where k = 10 haploid individuals are

sequenced separately and the most frequently read base at a po-

sition is chosen for the sequenced individual. Expected coverage

λ = 5(A), 10(B), 20(C) per individual. For pooling experiments,

the expected total coverage is kλ. Since the pool size is not rele-

vant in this context, we plot results for completely dependent (left

column) and independent sequencing errors (right) instead. See

the methods section for a more detailed description of these sce-

narios.

20 NGS EXPERIMENTS WITH POOLED SAMPLES

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

5 10 15 20 25 30 35 40

0

1

0

2

0

3

0

4

0

(a) lambda=5

k

c

o

r

r

e

s

p

o

n

d

i

n

g

k

w

i

t

h

p

o

o

l

i

n

g

+++++++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

ooooooooooooooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

5 10 15 20 25 30 35 40

0

1

0

2

0

3

0

4

0

(b) lambda=10

k

c

o

r

r

e

s

p

o

n

d

i

n

g

k

w

i

t

h

p

o

o

l

i

n

g

++++++++++++++++++++++++++++++++++

+

+

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Figure 3. Sequencing eﬀort k

∗

of a pooling experiment in or-

der to get allele frequency estimates with the same accuracy as in

a standard experimental setup where k individuals are sequenced

separately. (“o”: pool size n = 50, “+”: n = 100, “x”: n = 500.)

NGS EXPERIMENTS WITH POOLED SAMPLES 21

0 50 100 200

0

2

4

6

8

1

0

Watterson’s theta

n

m

e

a

n

e

s

t

i

m

a

t

e

0 50 100 200

0

2

4

6

8

1

0

Tajima’s pi

n

m

e

a

n

e

s

t

i

m

a

t

e

Figure 4. Expected value of the estimates obtained from pooled

samples depending on the pool size n: Watterson’s θ and Tajima’s

π. True value θ = 10 (green line). There is a considerable bias,

if n is small compared to kλ, illustrating the need to use a bias

correction with the estimates. Solid black line: λ = 30; red dashed

line: λ = 5. (For Tajima’s π, the bias does not depend on λ.)

22 NGS EXPERIMENTS WITH POOLED SAMPLES

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Watterson’s theta, m.a.f>=1

n

v

a

r

i

a

n

c

e

r

a

t

i

o

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Tajima’s pi, m.a.f.>=1

n

v

a

r

i

a

n

c

e

r

a

t

i

o

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Watterson’s theta, m.a.f.>=2

n

v

a

r

i

a

n

c

e

r

a

t

i

o

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Tajima’s pi, m.a.f.>=2

n

v

a

r

i

a

n

c

e

r

a

t

i

o

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Watterson’s theta, m.a.f.>=3

n

v

a

r

i

a

n

c

e

r

a

t

i

o

0 50 100 150 200

0

.

4

0

.

8

1

.

2

1

.

6

Tajima’s pi, m.a.f.>=3

n

v

a

r

i

a

n

c

e

r

a

t

i

o

Figure 5. Variance ratio (V ar

pooled

/V ar

standard

) of the bias cor-

rected version of Watterson’s θ and Tajima’s π depending on the

pool size n. We consider pooling both without (minor allele fre-

quency (m.a.f.) ≥ 1), and with a protection (m.a.f. ≥ 2, m.a.f. ≥ 3)

against sequencing errors. (Only segregating sites with minor allele

frequency m.a.f. above the stated threshold are included.) The

horizontal green line denotes the break even ratio of one, where

both the pooled and the classical experiment leads to estimates

with equal variances. Pooling always performs better, as soon as

the size of the pool exceeds the number of separately sequenced

individuals. Solid black line: λ = 30; red dashed line: λ = 5.

Standard setup with k = 10 individuals sequenced separately.

NGS EXPERIMENTS WITH POOLED SAMPLES 23

0 50 100 150 200

0

2

4

6

8

1

0

Watterson’s theta

n

b

i

a

s

(

s

o

l

i

d

)

,

s

d

(

d

a

s

h

e

d

)

0 50 100 150 200

0

2

4

6

8

1

0

Tajima’s pi

n

b

i

a

s

(

s

o

l

i

d

)

,

s

d

(

d

a

s

h

e

d

)

0 50 100 150 200

0

1

0

3

0

5

0

Watterson’s theta

n

s

q

u

a

r

e

d

b

i

a

s

(

s

o

l

i

d

)

,

v

a

r

(

d

a

s

h

e

d

)

0 50 100 150 200

0

1

0

3

0

5

0

Tajima’s pi

n

s

q

u

a

r

e

d

b

i

a

s

(

s

o

l

i

d

)

,

v

a

r

(

d

a

s

h

e

d

)

Figure 6. Bias (solid lines) and variance (dashed lines) of Watter-

son’s θ and Tajima’s π depending on the extent of heterogeneity in

probe material. Black lines: moderate heterogeneity (scale =2); red

lines: high heterogeneity (scale =8). In the ﬁrst row bias and stan-

dard deviations are plotted for the population genetic estimates.

The second row contains the squared bias and the variance, that

add up to the mean squared error. (Further parameters: λ = 30,

k = 10; log-normal parameters: µ = 0, σ = log(scale).)

24 NGS EXPERIMENTS WITH POOLED SAMPLES

0 50 100 150 200

0

.

0

0

.

5

1

.

0

1

.

5

Watterson’s theta, scale = 2

n

M

S

E

r

a

t

i

o

0 50 100 150 200

0

.

0

0

.

5

1

.

0

1

.

5

Tajima’s pi, scale = 2

n

M

S

E

r

a

t

i

o

0 50 100 150 200

0

.

0

0

.

5

1

.

0

1

.

5

Watterson’s theta, scale = 8

n

M

S

E

r

a

t

i

o

0 50 100 150 200

0

.

0

0

.

5

1

.

0

1

.

5

Tajima’s pi, scale = 8

n

M

S

E

r

a

t

i

o

Figure 7. Mean squared error ratio (MSE

pooled

/MSE

standard

) of

Watterson’s θ and Tajima’s π depending on the pool size n and

for λ = 30. Solid black line: The same amount of probe material

is available for all individuals. Red dashed line: the amount of

probe material diﬀers from individual to individual according to

log-normal factors. For the top two panels, both curves are nearly

identical. The median factor is always one, and with a scale of two,

about 32% of all probes deviate by a factor of more than the value

given by scale. For a scale value of two (for instance), 16% of probes

involve more than double the median probe amount, and another

16% contain less than one half the median amount. (Log-normal

parameters: µ = 0, σ = log(scale), scale ∈ {2, 8}.)

NGS EXPERIMENTS WITH POOLED SAMPLES 25

100 200 300 400 500

0

.

0

0

.

5

1

.

0

1

.

5

2

.

0

2

.

5

3

.

0

n

v

a

r

i

a

n

c

e

s

r

a

t

i

o

s

Figure 8. Variance ratios (V ar

pooled

/V ar

standard

) when estimat-

ing allele frequencies in the case where the amount of probe ma-

terial diﬀers from individual to individual according to log-normal

factors. The median factor is always one, and with a scale of s,

about 32% of all probes deviate by a factor of more than s. For

the scale value s = 2 (for instance) 16% of probes involve more

than double the median probe amount, and another 16% contain

less than one half the median amount. Ratios smaller than one

indicate that pooling leads to estimates with a smaller variance.

Individual sequencing is carried out for ten individuals with an ex-

pected coverage of λ = 10. Scales: s = 2 (red dashed line), s = 4

(green dotted line), s = 8 (blue dash-dotted line) (Log-normal pa-

rameters: µ = 0, σ = log(scale), scale ∈ {2, 4, 8}.)

2

NGS EXPERIMENTS WITH POOLED SAMPLES

Abstract Next generation sequencing (NGS) is about to revolutionize genetic analysis. Currently NGS techniques are mainly used to sequence individual genomes. Due to the high sequence coverage required, the costs for population scale analyses are still too high to allow an extension to non-model organisms. Here, we show that NGS of pools of individuals is often more eﬀective in SNP discovery and provides more accurate allele frequency estimates, even when taking sequencing errors into account. We modify the population genetic estimators Tajima’s π and Watterson’s θ to obtain unbiased estimates from NGS pooling data. Given the same sequencing eﬀort, the resulting estimators often show a better performance than those obtained from individual sequencing. Although our analysis also shows that NGS of pools of individuals will not be preferable under all circumstances, it provides a cost eﬀective approach to estimate allele frequencies on a genome-wide scale.

we explore the potential of DNA pooling to provide a more cost-eﬀective approach for SNP discovery and genome wide population genetics. Obviously haplotype information is not available from pooling experiments. 2009) for a discussion of next generation sequencing of diploid individuals. cost could be reduced by a more eﬀective sampling strategy. which permits sequencing of entire genomes at a fraction of the costs for Sanger sequencing. for the ﬁrst time it has become feasible to obtain the complete genomic sequence for a large number of individuals. D.NGS EXPERIMENTS WITH POOLED SAMPLES 3 Next generation sequencing (NGS) is about to revolutionize biology. (2009) for the modeling of sequencing errors and Erlich et al. more care has to be taken to establish an appropriate control of sequencing errors. melanogaster and A. As many of the reads only provide redundant information. In order to obtain full coverage of the entire genome and to have high conﬁdence that all heterozygous sites were discovered. pooling seems also to be a good choice for SNP detection experiments. See Lynch (2008. NGS provides an enormous number of reads. we derive several mathematical expressions that permit us to compare pooling with separate sequencing of individuals. despite the enormous cost reduction. A reader who is only interested in the actual diﬀerences under several scenarios might therefore want to move directly to the results section. and provides thus an economic alternative to the sequencing of individual genomes. Through a massive parallelization. Sequencing a large pool of individuals simultaneously keeps the number of redundant DNA reads low. When sequencing errors are not too common. our analysis shows that with suﬃciently large pool sizes. In this report. as we are just merging pools of size 2 to a larger pool in this case. we compare pooling with individual sequencing of haploid individuals. . See for instance Jiang et al. pooling usually outperforms the separate sequencing of individuals. both for estimating allele frequencies and inference of population genetic parameters. To avoid the additional challenges encountered with individual sequencing of diploid individuals. The extraction of as much statistical information as possible at cost as low as possible has therefore already attracted considerable interest. Current genome-wide re-sequencing projects collect the sequences individual by individual. but this will often be outweighed by the increased accuracy in population genetic inference. For several organisms. genome sequencing on a population scale is still out of reach for the budget of most laboratories. Hence. including human. On the other hand. In the methods section. Our results for the pooling experiments should be also applicable to a diploid setting. Focusing on biallelic loci. Nevertheless. thaliana. leading to a pool size of n = 2nd for nd diploid individuals. it is required that genomes are sequenced at a suﬃciently high coverage. (2009) for the eﬃcient tagging of sequences. large re-sequencing projects are well on their way. These formulas are then applied in the results section in order to illustrate the diﬀerences in accuracy between the approaches.

e.e. except for l = 0 or l = k. Generally speaking. where qc (l. Separate Sequencing of Individuals. Assuming that for each individual the number of reads at a particular locus is Poisson distributed with parameter λ. i. Given that exactly LA = l out of k individuals in the sample have an allele of type A. one might for instance sequence each of the k individuals on a separate Illumina lane with coverage λ. With the same sequencing eﬀort. We will consider SNP detection both in the context of pooling experiments and for individual sequencing. For the convenience of the reader. we will consider an individual sequencing project where k individuals are sequenced each with an expected coverage λ. In pooling experiments. the expected coverage will then be kλ. To assess the performance of these two competing scenarios. This leads to the probability qc (l. Notice that for larger values of λ. i. We extend work by Eberle and Kruglyak (2000) on SNP detection. that in the case of diploid individuals. by which we mean that any given locus is sequenced λ times on average. k. and derive both the power and error rates for pooling experiments and for separate sequencing. SNP detection. (Ma ) denote the number of times allele A (a) is sequenced. leading to a total coverage of kλ. a simple way to control the probability of falsely detecting SNPs both in the haploid and in the diploid case is to require a certain minimum number of reads for the minor allele in order to call a SNP. Let MA . the probability of detecting polymorphism is equal to the probability of reading at least one of the A and one of the remaining a alleles in the sample. λ) := (1 − [exp(−λ)]l )(1 − [exp(−λ)]k−l ) for getting at least one “A” and one “a” read. Note. the probability of sequencing errors being interpreted as true SNPs can be reduced by a suﬃciently high expected coverage if the genotype of an individual is inferred by the majority of reads. the distinction between sequencing errors and true SNPs is signiﬁcantly more complicated. this probability is nearly one. λ) = 0. A SNP is detected at a site if the site is polymorphic. an experimental design that provides high power while keeping the probability of incorrectly detecting a SNP small.4 NGS EXPERIMENTS WITH POOLED SAMPLES Methods Throughout. In practice. any particular locus will be read kλ times on average from the pool consisting of n individuals. Then. For a comparable pooling experiment that involves the same amount of sequencing eﬀort. we will look both at the power and at the probability of falsely calling a SNP due to sequencing errors. the probability of not covering the SNP locus for an individual is exp(−λ). Suppose now that our population size N is fairly large and that the relative frequency of allele A is p in the population. k. we summarize our notation in Table 1. if at least two alleles A and a are found in the sequenced sample. by conditioning on the number . When individuals are sequenced separately. the pool could be sequenced on k lanes simultaneously. will be preferable.

The probability of falsely detecting a SNP is (4) k r r λ (i) 3 qe (k. ǫ) = 1 − 1 − (ǫ/3)i (1 − (ǫ/3))(r−i) exp(−λ) . we obtain the probability of reading at least one . k. We expect the actual error probabilities somewhere between these scenarios. The ﬁrst. λ) ≈ 1 − pk − (1 − p)k . r! i r≥1 i>r/2 The resulting error probabilities can be made very small by ensuring a coverage λ that is large enough. Given a frequency LA = l of A alleles in the pool. λ) = l=1 qc (l. Let F(P ) (b. A natural way to proceed for individual sequencing is to assume that the most frequently read base for an individual is the true one. we assume furthermore that the three possible wrong bases are chosen with the same probability. For the dependent case. r! i r≥1 i>r/2 We will now derive the probability of wrongly detecting a SNP due to sequencing errors. each with probability ǫ/3. λ. Pooling Experiment. γ) = i=0 γ exp(−γ) denote the probability that i! a Poisson random variable with parameter γ is at most b. we obtain that (2) q(p. In the independent case. In this situation. Obviously a more sophisticated rule will be needed when sequencing diploid individuals. We now assume that a pooled sample of size n is sequenced with the same expected total number kλ of reads per locus as for sepi b arate sequencing. k. ǫ) = 1 − 1 − ǫ (1 − ǫ)(r−i) exp(−λ) . k. The second assumes independent errors such that each sequencing error leads to an independently chosen wrong base. assumes complete dependence such that sequencing errors at a given position always lead to the same incorrect base. Concerning the dependence of the reading errors. an error is made by choosing one of the three incorrect bases at random. λ) k l p (1 − p)k−l . the probability of detecting a SNP is approximately k−1 (1) q(p. l For large values of λ. we consider two extreme scenarios.NGS EXPERIMENTS WITH POOLED SAMPLES 5 l of A alleles in the sample. as well as the probability ǫ that a single read for a given base is incorrect and furthermore on whether the errors are independent. The probability that this leads to the wrong decision depends on the number of reads available for the locus under investigation. we obtain by conditioning on the (Poisson) number of reads for an individual at a locus k r r i λ (d) (3) qe (k. λ. more pessimistic scenario.

ǫ. n Now this leads to the probability of detecting a SNP n−1 (6) l=1 1 − F(P ) (0. λkǫ))[1 − F(P ) (0. With independent sequencing errors. λ. (n − l)kλ ) . We consider a locus with expected relative frequency p in the population. λ. they are easily confounded with low frequency alleles. lkλ ) n 1 − F(P ) (0. A common strategy to reduce the high probability of sequencing errors is to consider only SNPs that are detected in at least b reads. In the dependent scenario. lkλ ) n 1 − F(P ) (0. . b) = 3 1 − F(P ) (b − 1. The variance of Rc can be obtained as Var(Rc ) = Var MA J = E Var MA |J J + Var E MA |J J . Then the probability that a speciﬁc locus is read for J = j of the k individuals is rj. e Allele frequency inference. λk(1 − ǫ)) is very close to one and can be omitted without changing the results much. the probability of wrong SNP detection equals the probability (8) p(d) (k. an upper bound for the probability of falsely detecting a SNP is given by (9) p(i) (k.6 NGS EXPERIMENTS WITH POOLED SAMPLES A and one a allele as (5) 1 − F(P ) (0. we again derive the probability of wrongly detecting a SNP under two scenarios for the sequencing errors. Suppose ﬁrst that the individuals are sequenced separately with an expected coverage of λ. Requiring a minimum number b of reads in our context. the term 1 − F(P ) (0. j Given that reads are available for J = j out of the k individuals. lkλ ) n 1 − F(P ) (b − 1. (n − l)kλ ) n n l p (1 − p)n−l . b) = (1 − F(P ) (b − 1. the relative frequency of A alleles is Rc := MA /j. λk(1 − ǫ))] e of making at least b sequencing errors and getting at least one correct read. the probability of detecting a SNP changes to n−1 (7) l=1 1 − F(P ) (b − 1. (n − l)kλ ) n n l p (1 − p)n−l l which occurs with a proportion p in the population. l As with individual sequencing. ǫ. If the expected number of reads λk is fairly large. λkǫ/3) . As sequencing errors are common in NGS.k := k (1 − e−λ )j e−(k−j)λ .

Furthermore E[ MA |J] = p and therefore Var E[ MA |J] = 0. of A alleles. we obtain Var(Rp ) = E n−1 p(1 − p) p(1 − p) + . this variance component can be kept small by choosing pools of large enough size. n + Var E MA |U M . the estimators show some bias. in particular for small pools. we assume MA (Ma ) reads of the A (a) allele from this sample. and with U = (M. (2009).NGS EXPERIMENTS WITH POOLED SAMPLES 7 Now given J. According to our simulations shown in the results section however. However. p. J MA |J J = p(1 − p)/J. Estimating population genetic parameters. Notice that the variance for the pooling experiment increases when individuals contribute unequal amounts of probe material. We investigate the inﬂuence of the two sequencing strategies on the accuracy of these summary statistics. Together J J Var(Rc ) = p(1 − p)E(1/J) ≥ p(1 − p)/k. This leads to M = MA + Ma reads for the site under investigation. For a large enough expected coverage λ we get E(1/J) ≈ 1/k and E (1/M) ≈ 1/(kλ). LA ). We now turn to the pooling experiment. LA /n). According to our simulations.p) distributed and Var This leads to MA E Var |J = E(1/J)p(1 − p). The relative frequency of the A allele estimated from the sample is then given as Rp = MA /M. We again decompose the variance into Var(Rp ) = Var Now Var MA |U M MA M = E Var and E 1 M MA |U M = LA . With LA again denoting the number of A alleles in a pooled sample of size n. both summary statistics show a signiﬁcantly smaller variance for pooled samples. = 1 LA n−LA M n n MA |U M Together. The computation of variances for these estimators would depend on the speciﬁc assumptions of a probability model for the quality scores. Allele frequency estimators for pooled samples that also take into account quality scores of the individual reads have been discussed in Holt et al. According to our model M is Poisson P ois(kλ). (10) Var(Rc ) E(1/J) It is convenient that the ratio does not depend on the population proportion p of A alleles anymore. MA |U is binomial B(M. assuming again a population proportion. MA is binomial B(J. The reason for the bias is that multiple . n n In order to see which experimental setup leads to the smaller variance. Two widely used summary statistics in population genetics are Tajima’s π and Watterson’s θ. we consider the ratio 1 1 E M n−1 + n Var(Rp ) n = .

l . such an estimate could ˆ∗ be used to correct θπ . These bias corrected estimators will then be compared with those obtained from individual sequencing. equation (1)) for the standard n ˆ∗ experimental setup. In order to also correct for sequencing errors. Since this bias correction only depends on the size n of the pool and not on the coverage by reads. Based on the expected values of Tajima’s π and Watterson’s θ. leading to θπ = n−1 l θπ.8 NGS EXPERIMENTS WITH POOLED SAMPLES reads of the same sequence are entering the normalizing constant as independently sampled sequences. Another way to take into account sequencing errors is ˆ to require a minimum minor allele frequency b for including a segregating site in the analysis. θπ −2 n−1 µerr will be unbiased. and if a minimum minor allele frequency is required in order to make sequencing errors rare. and to ignore sequencing errors subsequently. Introducing µerr will obviously add ˆ to the variance of the resulting estimator and the overall performance will depend on the accuracy of µerr . the eﬀect of omitting singletons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008). if we neglect sequencing errors. if the estimators are computed in a standard way for pooled samples. Let ∆ij denote the number of diﬀerences between the sequences i and j at this locus that are selected randomly with replacement from the pool of n individuals. = θ n n ˆ Therefore n−1 θπ will be unbiased. We do not consider sequencing errors for the moment. Now for this locus ˆ Eθπ = E i=j ∆ij / M 2 (11) = E∆IJ = E[∆IJ |I = J]P(I = J) + E[∆IJ |I = J]P(I = J) = 0 + θP(I = J) n−1 . Analogous to Achaz (2008. For individual sequencing. we ﬁrst consider a locus for which the coverage is equal to M. The idea is that sequencing errors will be rare if b is suﬃciently large.) We ﬁrst derive a bias correction for Tajima’s π and start by considering a locus for which M reads are available. Sequencing errors also lead to bias. a bias corrected version of Tajima’s π for the entire sequence can be ˆ obtained by adding up individual values of θπ. if µerr is an unbiased estimate ˆ ˆ of the number of reading errors per sequence. and focus on the bias that is caused by possibly reading the same sequence more than once. two approaches seem feasible. we introduce modiﬁed normalizing constants that make the resulting estimators unbiased under neutrality. (See the RESULTS section.l for all loci and then multiplying n n ∗ ˆ ˆ by n−1 . ˆ(b) Again. If an unbiased estimate for the sequencing errors is available. this needs to be taken into account. Let θπ denote the version of Tajima’s π where the minor allele frequency is required to .

and Yn the number of A alleles in the pool. and let furthermore XM denote the number of A alleles i=1 among the reads. θπ may be written as ˆ(b) θπ = M 2 −1 M −b Km m(M − m) m=b for a locus for which M reads are available (see section 1. Let cn = 1/ n−1 i−1 . E(θπ ) = θ n For b > 1 the sum does not simplify much. With cn θ being the expected number of segregating sites in the pool. the error drops to 0. Then M P (XM = m|Yn = r) = (r/n)m (1 − r/n)M −m m and under neutrality P (Yn = r) = r −1 /cn . r=1 However. Summarizing.e. In this case n−1 P (XM = m|Yn = r)P (Yn = r) ≈ c−1 n r=1 1 m for 1 ≤ m ≤ M − 1 and therefore (13) M − 2b + 1 .02% for b = 2 and 4 · 10−5 % for b = 3. but can be computed and turned into the bias correction factor M [ 2 m=b M −b n−1 m(M − m)P (XM = m|Yn = r)r −1 ]−1 . n−1 ˆ(1) . Notice that θπ = θπ for b = 1. an accurate approximation for (12) can be obtained by assuming that n is large compared to M. then the relative error is only 0. For n = 200 and M = 50. i. (12) ˆ(b) E(θπ ) = M 2 −1 M −b n−1 cn θ m=b r=1 m(M − m)P (XM = m|Yn = r)P (Yn = r) For b = 1. even if the pool size n is only moderately larger than the number of reads M. the resulting simple bias correction factor MM −1 turns out to provide −2b+1 very good approximations.NGS EXPERIMENTS WITH POOLED SAMPLES 9 ˆ(b) ˆ be at least b.4 in Durrett (2008)).4% when M = 10 and n = 20. if singletons are omitted (b = 2). we propose the following bias corrected version of Tajima’s π : n ˆ θ for b = 1. ˆ(b) E(θπ ) ≈ θ M −1 . n−1 π ˆ(b)∗ = (14) θπ M −2b+1 ˆ(b) θπ for b > 1. M −1 For b > 1. straightforward calculations reproduce (11). With Km denoting the number of ˆ(b) sites where the derived allele A has frequency m. Indeed.

We consider a version of Watterson’s estimator that requires a minimum minor allele frequency b.l separately for each locus l. M. n−1 1 r=1 [F(B) (M − b. Analogous to (12). Then L (19) ˆ(b)∗ θW = l=1 ˆ(b)∗ θW. and obtain the bias corrected Watterson ˆ(b)∗ estimate θW. and versions that protect against sequencing errors can be obtained by choosing b > 1. again ﬁrst focusing on a locus with coverage M. x F(B) (x.l . M. Recall furthermore that cM = for b ≥ 1 is given as (18) ˆ(b)∗ θW = M −1 −1 i=1 i . For this purpose. r/n)] r ˆ(b)∗ As with Tajima’s π. r=1 Let F(B) (x. p) denote the probability that a binomial random variable X satisﬁes P (X ≤ x) for M trials with success probability p. θW can be easily adapted to work with longer sequences. partition the sequence into L loci such that for each locus a constant number of reads Ml is available. an estimator for the scaled mutation parameter per base results. ˆ(b) Then a bias corrected version of θW ˆ(b) θW cM . M. ˆ Dividing θπ by the total length of the considered sequence. In particular for p = r/n. we obtain that conditional on the number of reads M for the locus (17) cn ˆ(b) E(θW |M) = θ[ cM m=b M −b n−1 P (XM = m|Yn = r)P (Yn = r)]. For b = 1 we use all segregating sites. r/n) = i=0 M i r n i 1− r n M −i . We now derive a bias correction for Watterson’s estimator. Let Sb denote the number of segregating sites found in the M sequence reads from the pool for which the minor allele frequency is at least b. r/n) − F(B) (b − 1. M. if b is large enough. simply take the sum over the individually bias corrected estimates L (15) (b)∗ ˆ(b)∗ θπ = l=1 ˆ(b)∗ θπ.10 NGS EXPERIMENTS WITH POOLED SAMPLES To obtain an overall estimate based on L loci with possibly unequal coverage Ml (1 ≤ l ≤ L). Then Sb ˆ(b) θW := M −1 (16) i=1 1/i provides protection against sequencing errors. .l .

by quality ﬁltering). In an individual based sequencing strategy. In the methods section. in some cases. While pooling experiments provide a higher probability of SNP detection in most cases. most of the sampling variation comes from the selection of individuals used for DNA sequencing. Irrespective of the assumed model of sequencing errors (see methods section for further details). Allele frequency inference. On the other hand. it is expected to be less eﬃcient. a high probability of sequencing errors makes SNP calling from pools highly unreliable. The advantage of the pooling approach is that this sampling error can be dramatically reduced by including a large number of individuals in the pool. For a . As long as not chosen too small. if both the coverage is small and a a high minimum number of reads is required. This is not entirely unexpected. Interestingly. and a classical experiment with individual sequencing Rc . We start by discussing the situation where individuals contribute equal amounts of probe material and refer to the last paragraph of the section for the case when this assumption is violated. Dividing θW by the total length of the considered sequence. the size of the pool seems to play a less important role. We compared the eﬃciency of SNP discovery using an approach in which each individual is sequenced separately with a pooling approach. if sequencing error rates are reduced (e. The higher the expected coverage. In population genetics. the allele frequency spectrum is of central interest. the more ineﬃcient individual sequencing becomes.g. an estimator for the scaled mutation parameter per base results.NGS EXPERIMENTS WITH POOLED SAMPLES 11 ˆ(b)∗ provides an estimate of the overall scaled mutation parameter. we found pooling to result in fewer erroneous SNP calls than individual sequencing. a second level of sampling error arises in the pooling approach from the fact that not all chromosomes in the pool are sequenced and some chromosomes may be sequenced more than once. Figure 2 addresses the problem of wrongly identifying a sequencing error as a SNP. The panels of Figure 1 show that the comparative eﬃciency of pooling depends both on the expected coverage and on the minimum number of reads for allele calling used for error protection. a suitable lower bound on the minimum allele frequency for detecting a SNP makes pooling very reliable for the identiﬁcation of SNPs. On the other hand. For many biological applications SNP genotyping provides a cost eﬀective approach. and SNP discovery is the ﬁrst step required. since an increased number of reads required for the inference of the minor allele reduces the probability of detecting SNPs in a pooling experiment. we obtained expression (10) for the ratio of the variances of the estimated relative allele frequency both for a pooling experiment Rp (pool size n). Results SNP Detection. Estimating the allele frequency spectrum of a population is subject to sampling variation.

this equation can be approximated by the following quick rule of thumb: Pooling will lead k to a smaller variance for those experimental setups that satisfy 1/λ + n < 1 or equivalently n/(n − k) < λ. Alternatively. the standard deviation entering the test statistic will diﬀer by the square root of (10) between a pooling and a classical experimental setup. Thus a case where pooling provides a better estimate of the allele frequency is when the pool contains more than twice the number of separately sequenced individuals and the coverage λ per separately sequenced individual is at least two. corresponding to an individual sequencing project with 10 instead of 20 individuals. whenever the variance ratio is smaller than one (see (10)). we took random sub-samples of size k = 10 from each sample.out) where 500 is the number of sequences generated. Figure 3 compares the pooling approach to sequencing of individuals when both methods provide the same accuracy for allele frequency estimates. (see Sham et al. It should be noted however that the variance of the pooling experiment will become larger if individuals contribute unequal amounts of probe material. For separate sequencing. (. each at an expected coverage λ. With . the superiority of the pooling approach could be expressed by the reduction of sequencing costs. the shift of the expected value of the test statistic under the alternative will be twice as large in a pooling experiment: Overall pooling will be the more powerful approach. thus simulating separate sequencing of 10 individuals each with an expected number λ of reads. This issue will be addressed in the last paragraph of this section. 2002). (2002)). then pooling would give the same accuracy with half the sequencing eﬀort. In genome-wide association studies (GWA’s). for instance./ms 500 100 -t 10 > ms. Estimating population genetic parameters. For this purpose. using the ms software (Hudson. Suppose that k individuals are sequenced separately. A possible approach is to test whether alleles have diﬀerent frequencies in two pools that diﬀer with respect to the trait of interest.12 NGS EXPERIMENTS WITH POOLED SAMPLES large enough expected coverage λ and with k individuals sequenced. We now compare the estimation of the scaled mutation parameter using Watterson’s θ and Tajima’s π under our two experimental setups. Since the ratio of variances (10) does not depend on the allele frequencies in the sub-populations. we simulated 100 samples under neutrality with mutation parameter θ = 10. Figure 3 clearly indicates that larger pool sizes increase the advantage of sequencing pools. If. the association between allele frequencies and traits (diseases) is investigated. If the square root of (10) is 1/2 (say). Then k ∗ indicates the cost in single genome sequencing equivalents that results in the same accuracy as sequencing k genomes individually. For larger pools smaller values of λ will be suﬃcient. A higher sequence coverage (λ) for sequencing of individuals further improves the cost eﬀectiveness of pooling. So far we compared the individual based and pooling strategy only for the same number of sequenced reads. k = 20 and k ∗ = 10.

multiple reads of the same chromosome become more common. Let b denote the minimum required minor allele frequency. From this pool. Thus the median amount of probe material is always equal to one. the deviation factors have been chosen independently for each individual contributing to the pool according to exp(Xi ). where Xi (1 ≤ i ≤ n) are normal N(0. this means that the respective individual will have a 50% higher chance of being sequenced. than another with a factor of exp(Xi ) = 1. we used the bias correction (14) for individual loci and added the estimates across loci using (15)). (Figure 5. We set the expected amount of probe material to one. log(scale)) random variables.NGS EXPERIMENTS WITH POOLED SAMPLES 13 pooling. Neglecting sequencing errors for the moment. This aﬀects the bias and the variance also for our considered population genetic summary statistics. provided that the size of the pool is large enough. the bias has been corrected using formula (18) for each locus. since frequency one minor alleles receive a low weight in the calculation of π.) We now investigate the pooling approach when including a protection against sequencing errors by removing all segregating sites where the minor allele has frequency x satisﬁes x = 1 or alternatively x ≤ 2. Not unexpectedly. we simulated a scenario involving unequal amounts of probe material. the normalizing constants have been adapted in order to avoid bias. The increase in ˆ(b)∗ the variance of Tajima’s θπ is much smaller. For Tajima’s π.8 means a 20% decreased chance of being read. . In experiments that rely on PCR ampliﬁcation. Figure 5 shows the relative advantage of pooling conditional on diﬀerent minimum minor allele frequencies. it turns out that the pooling approach with bias correction leads to more accurate estimates of θ and π. will be over-represented. a value of 0. Individuals for which a larger DNA amount has been included in the DNA pool. we took samples of size n out of the 500 simulated sequences. the reducˆ(b)∗ tion in variance is now somewhat smaller for Watterson’s θW . If the deviation factor has a value of exp(Xi ) = 1. One obvious source of error in the pooling approach is the heterogeneity in DNA amounts due to measurement errors. which aﬀects the accuracy of the estimates negatively. For small pools. Similarly. Unequal amounts of probe material. the heterogeneity can be expected to be particularly strong. which potentially causes a change in allele frequency estimates.5. Again. Pooling still leads to a decreased variance under neutrality as long as the pool size is large enough. and allowed for log-normally distributed multiplicative deviations from this expected value. reads were taken independently for each locus l by making a random number of draws Ml with replacement. The quantities Ml were chosen according to a Poisson distribution with expected value kλ. To investigate the sensitivity of population genetic estimates based on pooling experiments. More speciﬁcally. For Watterson’s θ.

we showed that sequencing . Figure 6 displays the eﬀect of heterogeneity in probe material on the bias and the variance of Tajima’s π and Watterson’s θ. the performance (measured in terms of the MSE) changes only marginally. Interestingly. We measured the deviation from the true θ by the mean squared error. This is illustrated in Figure 8. Heterogeneity in probe material also aﬀects the accuracy of the estimated allele frequencies. this eﬀect can be kept small. As further parameters we chose λ = 30. it is anticipated that it will become possible to sequence individual genomes at a cost that allows even small laboratories to perform population analyses on a genome scale. see Figure 7. This eﬀect can be explained by shrinkage that leads to improved estimates of the mutation parameter θ by permitting for some bias (Futschik and Gach. With the arrival of the second generation sequencing technologies this race is about to come to an end . Although both bias and variance change noticably for higher levels of heterogeneity. As the amount of heterogeneity in the sample will usually be unknown. With further technological advances. In this study. the most abundant individual contributed about sixteen times the probe material of the least abundant individual. this is not possible as the costs are still too high. Thus the overall performance measured in terms of the mean squared error (20) MSE = Bias2 + Var changes only marginally even for a large level of heterogeneity (scale = 8). as this accounts for bias and variance. k = 10. as the variance of the estimator based on a pooled sample becomes larger. n ∈ [5. Discussion Over the past decades we have been witnessing a continuous turnover of molecular markers used in genetical research. by choosing a pool of a large enough size. these eﬀects cancel out to a large extent. where it can be seen that pooling leads for large enough pool sizes eventually to smaller variances even for a high level of heterogeneity in probe material (scale =8). To a large extent this turnover has been driven by the advances in molecular biology and technology. slightly more than 30% of all individuals diﬀered at least twofold from the median. 200]. even for a high level of heterogeneity in probe material (scale =8).14 NGS EXPERIMENTS WITH POOLED SAMPLES As our ﬁrst scenario (scale = 2). it has come into reach to have full genomic sequences available for multiple individuals. for a pool of size n = 100. we applied the same bias correction as for equal amounts of probe material.rather than relying on a more or less representative fraction of the genome. 2008). In other words. We also simulated a more extreme scenario (scale = 8) where about 30% of the individuals diﬀered at least eightfold from the median. However. Currently.

is supported by some recent studies using NGS to analyze DNA pools of selected genomic regions. de Maio. This is the ﬁrst report systematically exploring the parameter range for which DNA pooling provides an advantage compared to individual genome sequencing. Kosiol.NGS EXPERIMENTS WITH POOLED SAMPLES 15 pools of individuals provides an excellent alternative that permits genome wide polymorphism surveys at very moderate costs. We are grateful to A. L403-B11) awarded to CS. In contrast to SNP arrays. 2008) observed a correlation of 0. ACKNOWLEDGEMENTS This work has been supported by a WWTF grant to AF and CS as well as FWF Grants (P 19832-B11. 2008) sequenced a complexity reduced DNA pool using the Illumina Genome Analyzer. Wolf on general discussions about pooling a for NGS. it will be possible to compare patterns of diﬀerentiation on a genomic scale. .. In population genetics. We anticipate that the analysis of DNA pools provides a wide range of applications. re-sequencing of DNA pools will always include the causative SNP and thus provide a higher statistical power. Vasem¨gi and J.. Special thanks to C. (Van Tassell et al. patterns of local adaptation and heterogeneity in gene ﬂow among diﬀerent genomic regions can be identiﬁed. Hence. Thus. For a subset of the identiﬁed SNPs. Koﬂer for helpful comments on earlier versions of the manuscript. Despite that SNP frequency estimates were undoubtedly aﬀected by a substantial assignment error (Palmieri and Schl¨tterer. Also for association mapping DNA pools are very powerful (Sham et al. however. 2002). and R. (Van Tassell et al. Our study provides the basis for an adequate experimental design of future pooling experiments. Our result that NGS of DNA pools often provides a reliable and cost eﬀective mean for genome-wide allele frequency estimates.. there is very little doubt that NGS is an eﬀective tool to provide accurate genome-wide allele frequency estimates from DNA pools. N. they compared the allele frequency estimates from the Illumina sequencing to those obtained by genotyping the same individuals.67 between the two methods. and also thank the reviewers for helpful comments. 2009) due o to the short reads and the complexity reducing procedure.

. R. Hudson. Li. 3. M.. P. Chang. Smith.. Genetics 181. H. Haudenschild. C. W.. Teo. Estimation of allele frequencies from high-coverage genomesequencing projects. 2009. F. S. Mapping accuracy of short reads from maso sively parallel sequencing and the implications for quantitative expression proﬁling..1186/1471–2105–10–247. G. D.. C. K. Genetics 179. An analysis of strategies for discovery of singlenucleotide polymorphisms. Bader. D. Mol. L. 2008. Snp discovery and allele frequency estimation by deep sequencing of reduced representation libraries. A. Ronen. 212–221.. Parkhill. Testing for neutrality in samples with sequencing errors. Nair. On the inadmissibility of watterson’s estimate...... J. S. Nat. Schnabel. T.. G.16 NGS EXPERIMENTS WITH POOLED SAMPLES References Achaz. Gordon.. Jiang. Theoretical Population Biology 73. Matukumalli.. Methods. P. Genet.. 2421–2431.. Van Tassell.. M. D... R. P. disequilibrium coeﬃcients. Dougan. Springer. Warren. New York. P. Moore. B. Durrett. R. 1243–1253.. 2000. 1409–1424. 337–338. 2008.. Futschik. O.. Accurate and fast methods to estimate the population mutation rate from error prone sequences... Estimation of nucleotide diversity. Nature Rev. C... Biol. Genet. M. C. Sonstegard. K. 2008. L. M. 2009. Genetics 182. Lynch. Generating samples under a wright-ﬁsher neutral model of genetic variation. Palmieri. Detecting snps and estimating allele frequencies in clonal bacterial populations by sequencing pooled dna.. doi:10.. R. T. S. Y. M. P. P. 2002. J.. G. K. I.. 2002. R. Knudsen.. 2074–2075. S29–S35.. F. Epidemiol. O.. 295–301. Eberle. C. S. Bioinformatics 25. 187–197. Probabiliy models for DNA Sequence Evolution. S. S. BMC Bioinformatics 10:247. Evol.. 2009. Erlich. 862–871. R... e6323+. Gach... Bioinformatics 18.. Schl¨tterer.. J. T. Rooks. Wain. and mutation rates from high-coverage genome-sequencing projects. 25. Dna sudoku-harnessing high-throughput sequencing for multiplexed specimen analysis. 19. J.. D. M. Genome Research 19.. L.. Dna pooling: A tool for large-scale association studies. T. Y. Holt. K. Hannon. 2008. F. PloS one 4. C. Sham. Craig. M.. 2008. M.. O. Marjoram. Lawley. 2009. Navon. 2009. Tavar´. J. Kruglyak. Population genetic inference from e resequencing data. N. Taylor. A.. S. 2009. Lynch. S. Miyamoto..

p) binomial cumulative distribution function (F(B) (x. Description of our Notation Symbol or Notation Description k number of haploid individuals used for separate sequencing λ expected number of times a locus is read for an individual using separate sequencing n size of the pool in a pooling experiment J random number of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) M random number of reads for a particular locus in a pooling experiment (E(M) = kλ) p relative frequency of the allele of interest in the population F(P ) (b. M.NGS EXPERIMENTS WITH POOLED SAMPLES 17 Table 1. M. γ) = b γi i=0 i! exp(−γ)) F(B) (x. ˆ(b)∗ θW bias corrected version of Watterson’s θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1. For b = 1. ˆ(b)∗ ˆ∗ θπ = θπ . γ) Poisson cumulative distribution function (F(P ) (b. . p) = x M i M −i . i=0 i p (1 − p) (b)∗ ˆπ θ bias corrected version of Tajima’s π for a pooling experiment when the minor allele frequency is required to be at least b.

2 0. Probability of detecting a SNP with relative minor allele frequency p in the population when a certain minimum number of reads is required as a detection threshold. ≥ 6).0 0.6 0.6 0.6 0.a.6 0.4 0.1 0. the expected total coverage is kλ. Expected coverage λ = 5(A). red-dotted: minor allele frequency (m.5 freq in pop.8 0.0 0.18 NGS EXPERIMENTS WITH POOLED SAMPLES A: λ=5.1 0.0 1.f.2 0.8 prob of SNP detection prob of SNP detection 0.2 0.0 1.8 prob of SNP detection prob of SNP detection 0.0 C: λ=20.0 0. n=200 0. freq in pop.2 0.5 0. Pool sizes are either 50 (left column) or 200 (right column).2 0.8 0.3 0.0 A: λ=5.5 freq in pop.0 0. B: λ=10. freq in pop.3 0.5 0.2 0.4 0.0 1.4 0.1 0.6 0.) at least 2.3 0. freq in pop. n=50 1.f.0 B: λ=10.2 0.0 0.3 0.4 0.4 0.3 0. 20(C) per individual.2 0. n=200 0.0 0. n=50 1.8 prob of SNP detection prob of SNP detection 0.0 0.4 0.4 0.2 0.4 0. green long dashed: : m.0 0.3 0.2 0.5 freq in pop. 10(B).0 0.2 0.1 0.1 0. ≥ 4.0 0.1 0.0 0.4 0.2 0.0 0. Solid black line: Experiment where k = 10 haploid individuals are sequenced separately. Figure 1. For pooling experiments.a.4 0. n=50 1.8 0.f.5 0. C: λ=20. .4 0.6 0.a. n=200 0. The colored lines indicate the probabilities for sequencing experiments using a pooled sample (purple-dashed: no error correction.4 0. blue-dash-dot: m.

Solid black line: Experiment where k = 10 haploid individuals are sequenced separately and the most frequently read base at a position is chosen for the sequenced individual.i.5 −2.a.a. ≥ 6). errors i. log_10 sequencing error prob.0 −6 −4 −2 −3. green long dashed: : m.0 log_10 sequencing error prob. log_10−prob.f.NGS EXPERIMENTS WITH POOLED SAMPLES 19 A: λ=5.0 −1.5 −3.0 −2.0 log_10 sequencing error prob. of false SNP calling −4.0 −2. errors i.d.0 −2. See the methods section for a more detailed description of these scenarios.0 −2 −4 −6 −8 −8 −4.0 −1.5 −2.f.0 −3. log_10−prob.5 −2. in dependance on the logarithm of the sequencing error probability.i. For pooling experiments.0 −6 −4 −2 −3.5 −1. errors dependent 0 0 B: λ=10.5 −3.0 −2. the expected total coverage is kλ.0 −1. of false SNP calling log_10−prob. log_10 sequencing error prob.0 −2. Expected coverage λ = 5(A).0 −2.0 −1. The colored lines indicate the probabilities for sequencing experiments using a pooled sample (purple-dashed: no error correction.5 −1. of false SNP calling −4.0 −2 −4 −6 −8 −8 −4. Log-probability of falsely detecting a SNP at a nonsegregating site.0 −6 −4 −2 −3. we plot results for completely dependent (left column) and independent sequencing errors (right) instead.5 −1.5 −2.5 −3.5 −1. 20(C) per individual. of false SNP calling log_10−prob. B: λ=10.d.0 −3.i. log_10 sequencing error prob.0 −2 −4 −6 −8 −8 −4. errors i. of false SNP calling log_10−prob. Figure 2.a.0 −3.d.) at least 2.5 −2. C: λ=20.5 −1. red-dotted: minor allele frequency (m. 10(B). blue-dash-dot: m.5 −3. log_10−prob.0 −1.5 −1. . of false SNP calling −4.f.5 −3. ≥ 4. errors dependent 0 0 A: λ=5.0 −1.5 −2. Since the pool size is not relevant in this context.0 log_10 sequencing error prob. errors dependent 0 0 C: λ=20.5 −3.

) . (“o”: pool size n = 50. “+”: n = 100. Sequencing eﬀort k ∗ of a pooling experiment in order to get allele frequency estimates with the same accuracy as in a standard experimental setup where k individuals are sequenced separately. “x”: n = 500.20 NGS EXPERIMENTS WITH POOLED SAMPLES (a) lambda=5 (b) lambda=10 o o o 40 corresponding k with pooling 30 o o o o o o o o o + + o + + o + o + + o + + o + + o xx + o + xxx + o + xxxx o ++ o + + xxxxxx o + o + +xxxx o o + o + xxx o o + +xx + +x o + x o o x ++ x o +xx o +xx o o +x+x x+x corresponding k with pooling 20 30 40 20 o o o o o o o o o o o o o ++ ++ o o ++ o ++ ++ + + xxxxxxxx o o o o o ++++ o + xxxx+ o +xx+ + xxxxxx o o o + + o o + o +xxxx o + + + xx o o o + + + + xx o o o xx ++++ xxxxxx 10 0 5 10 15 20 k 25 30 35 40 0 10 5 10 15 20 k 25 30 35 40 Figure 3.

There is a considerable bias. if n is small compared to kλ.) .NGS EXPERIMENTS WITH POOLED SAMPLES 21 Watterson’s theta Tajima’s pi 10 8 mean estimate mean estimate 0 50 100 200 6 4 2 0 0 0 2 4 6 8 10 50 100 200 n n Figure 4. (For Tajima’s π. illustrating the need to use a bias correction with the estimates. True value θ = 10 (green line). Solid black line: λ = 30. Expected value of the estimates obtained from pooled samples depending on the pool size n: Watterson’s θ and Tajima’s π. red dashed line: λ = 5. the bias does not depend on λ.

m. .8 1.f.a. Pooling always performs better. ≥ 2. and with a protection (m.2 1.f.4 0.2 1.) The horizontal green line denotes the break even ratio of one. Standard setup with k = 10 individuals sequenced separately. m.a.a.22 NGS EXPERIMENTS WITH POOLED SAMPLES Watterson’s theta.a.8 1.a.a.>=3 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Figure 5.f.a.>=1 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Watterson’s theta.8 1.8 1.f>=1 0.6 0. m.2 1.a.4 0. above the stated threshold are included. We consider pooling both without (minor allele frequency (m.8 1. Solid black line: λ = 30. as soon as the size of the pool exceeds the number of separately sequenced individuals. m. ≥ 3) against sequencing errors. m.>=2 0.a.4 0. red dashed line: λ = 5.f.f. (Only segregating sites with minor allele frequency m. Variance ratio (V arpooled /V arstandard ) of the bias corrected version of Watterson’s θ and Tajima’s π depending on the pool size n.f. m.f.8 1.2 1.f.f. where both the pooled and the classical experiment leads to estimates with equal variances.6 Tajima’s pi.6 0.2 1.a.>=2 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Watterson’s theta.>=3 0.4 0.6 Tajima’s pi.4 0.6 0.2 1.6 Tajima’s pi. m.4 0.) ≥ 1).

sd (dashed) Tajima’s pi 10 0 2 0 8 6 4 0 2 0 50 100 n 150 200 4 6 8 50 100 n 150 200 Watterson’s theta squared bias (solid). σ = log(scale). var (dashed) squared bias (solid). The second row contains the squared bias and the variance. (Further parameters: λ = 30. sd (dashed) bias (solid). red lines: high heterogeneity (scale =8). that add up to the mean squared error.NGS EXPERIMENTS WITH POOLED SAMPLES 23 Watterson’s theta 10 bias (solid). log-normal parameters: µ = 0. In the ﬁrst row bias and standard deviations are plotted for the population genetic estimates.) . Black lines: moderate heterogeneity (scale =2). Bias (solid lines) and variance (dashed lines) of Watterson’s θ and Tajima’s π depending on the extent of heterogeneity in probe material. k = 10. var (dashed) Tajima’s pi 50 30 0 10 0 50 100 n 150 200 0 10 0 30 50 50 100 n 150 200 Figure 6.

0 0 50 100 n 150 200 0.5 0.5 1.5 0 50 100 n 150 200 Watterson’s theta. Mean squared error ratio (MSEpooled /MSEstandard ) of Watterson’s θ and Tajima’s π depending on the pool size n and for λ = 30. Red dashed line: the amount of probe material diﬀers from individual to individual according to log-normal factors.0 0. For a scale value of two (for instance).5 1.5 0 50 100 n 150 200 Figure 7.24 NGS EXPERIMENTS WITH POOLED SAMPLES Watterson’s theta. Solid black line: The same amount of probe material is available for all individuals. The median factor is always one.0 1. scale = 2 1. scale = 8 Tajima’s pi. 16% of probes involve more than double the median probe amount.0 0 50 100 n 150 200 0.0 0.5 0. (Log-normal parameters: µ = 0.0 0. scale = 2 Tajima’s pi. and with a scale of two. σ = log(scale). both curves are nearly identical.) .5 MSE ratio MSE ratio 1. and another 16% contain less than one half the median amount.5 MSE ratio MSE ratio 1. 8}. scale = 8 1.0 0. For the top two panels. scale ∈ {2. about 32% of all probes deviate by a factor of more than the value given by scale.0 1.

Variance ratios (V arpooled /V arstandard ) when estimating allele frequencies in the case where the amount of probe material diﬀers from individual to individual according to log-normal factors. Ratios smaller than one indicate that pooling leads to estimates with a smaller variance. s = 8 (blue dash-dotted line) (Log-normal parameters: µ = 0. scale ∈ {2. σ = log(scale). and another 16% contain less than one half the median amount. Scales: s = 2 (red dashed line).0 0. 8}.5 2. and with a scale of s. For the scale value s = 2 (for instance) 16% of probes involve more than double the median probe amount.NGS EXPERIMENTS WITH POOLED SAMPLES 25 variances ratios 0.) . about 32% of all probes deviate by a factor of more than s.5 1. Individual sequencing is carried out for ten individuals with an expected coverage of λ = 10. The median factor is always one.0 1.0 2.5 3. 4. s = 4 (green dotted line).0 100 200 300 n 400 500 Figure 8.

- jung-it
- M. Giles - Multilevel Monte Carlo Path Simulation
- Bootstrap Report
- Wikipedia - Standar Deviasi
- 10 Inferential Statistics
- toxicogenomics
- srep16916
- Comparing Two Measurement Devices
- 10.1.1.117.4657
- Drift Volatality
- SnP
- Empirical likelihood estimation of interest rate diusion model
- Normal1 Shiv
- danf
- SDD1.101
- Diagnostic Moléculaire Infections Fongiques
- Jrf Cs Sample2011
- Big_Data
- cBot Userguide
- Primer Design
- Genetic Heterogenity of Bvdv in Sa
- Genetics of Type 1A Diabetes
- 159623533-DNA of Uncultured Organisms Sequenced Using Novel Single Cell Approach
- Journal.pone.0013996
- am
- Chap 2
- d085p081
- Grolmusz Viola
- Application Note_Whole Genome
- Report All

Skip carousel

- Genetic Technologies v. General Genetics
- tmp1F64.tmp
- tmpF8FC.tmp
- Genetic Technologies v. Agilent Technologies
- tmp2C04.tmp
- Companion Guide
- tmp158.tmp
- Feel Good Biochemistry
- tmp94B4
- tmpDA0D
- tmpE337.tmp
- Genetic Technologies v. 454 Life Sciences
- Genetics and Alzheimers
- tmp6DF7.tmp
- tmpC58B.tmp
- tmpCDF7
- tmp1329.tmp
- tmpE32.tmp
- tmp68E0.tmp
- tmpA34.tmp
- tmp2BF9.tmp
- Who is at Risk? The Genetic Susceptibility to Alzheimer’s Disease
- Genetic Technologies Limited V. Glaxosmithkline LLC
- tmpFF53.tmp
- tmpA969.tmp
- Tmp 4331
- tmpD231.tmp
- Genetic Technologies v. Pfizer
- tmpEB12.tmp
- tmp2289.tmp

Skip carousel

- tmpCA22.tmp
- tmp6F5D.tmp
- Illumina v. Ariosa Diagnostics
- tmp625.tmp
- tmpAADF.tmp
- tmpB19C.tmp
- Improved probes for detection of V617F mutation in chromosome 9-borne JAK 2 gene linked to conditions of Polycythemia Vera and other myeloproliferative disorders
- tmpEAC2.tmp
- Sequence-Based Classification of Select Agents, Report in Brief
- Genetic Technologies v. Genelex
- tmp168B.tmp
- Genetic Technologies v. Reproductive Genetics Institute
- JID_Li_acne
- Genetic Technologies v. 454 Life Sciences
- tmp6189
- tmpD089.tmp
- tmpBF00
- Next Generation Sequencing
- Genome Graphic
- tmp2BF9.tmp
- Genetic Technologies Limited V. Glaxosmithkline LLC
- tmp614A.tmp
- tmpB4AD.tmp
- tmpE0A5.tmp
- tmpDA0D
- tmpF64.tmp
- Meridian Funds 2015 Annual Report to Investors
- Genetic Technologies v. Geneseek
- tmpA57C.tmp
- tmp34A5

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading