MASSIVELY PARALLEL SEQUENCING OF POOLED DNA

SAMPLES-THE NEXT GENERATION OF MOLECULAR
MARKERS
Authors and affiliations
Andreas Futschik
(1,∗)
and Christian Schl¨otterer
(2)
(1) : Department of Statistics, University of Vienna, Vienna, Austria
(2) : Institut f¨ ur Populationsgenetik, Veterin¨armedizinische Universit¨at Wien,
Wien, Austria.
(∗) : corresponding author.
1
Genetics: Published Articles Ahead of Print, published on May 10, 2010 as 10.1534/genetics.110.114397
2 NGS EXPERIMENTS WITH POOLED SAMPLES
Abstract
Next generation sequencing (NGS) is about to revolutionize genetic analysis.
Currently NGS techniques are mainly used to sequence individual genomes. Due
to the high sequence coverage required, the costs for population scale analyses
are still too high to allow an extension to non-model organisms. Here, we show
that NGS of pools of individuals is often more effective in SNP discovery and
provides more accurate allele frequency estimates, even when taking sequencing
errors into account. We modify the population genetic estimators Tajima’s π and
Watterson’s θ to obtain unbiased estimates from NGS pooling data. Given the
same sequencing effort, the resulting estimators often show a better performance
than those obtained from individual sequencing. Although our analysis also shows
that NGS of pools of individuals will not be preferable under all circumstances, it
provides a cost effective approach to estimate allele frequencies on a genome-wide
scale.
NGS EXPERIMENTS WITH POOLED SAMPLES 3
Next generation sequencing (NGS) is about to revolutionize biology. Through
a massive parallelization, NGS provides an enormous number of reads, which per-
mits sequencing of entire genomes at a fraction of the costs for Sanger sequencing.
Hence, for the first time it has become feasible to obtain the complete genomic
sequence for a large number of individuals. For several organisms, including hu-
man, D. melanogaster and A. thaliana, large re-sequencing projects are well on
their way. Nevertheless, despite the enormous cost reduction, genome sequencing
on a population scale is still out of reach for the budget of most laboratories.
The extraction of as much statistical information as possible at cost as low as
possible has therefore already attracted considerable interest. See for instance
Jiang et al. (2009) for the modeling of sequencing errors and Erlich et al. (2009)
for the efficient tagging of sequences.
Current genome-wide re-sequencing projects collect the sequences individual
by individual. In order to obtain full coverage of the entire genome and to have
high confidence that all heterozygous sites were discovered, it is required that
genomes are sequenced at a sufficiently high coverage. As many of the reads
only provide redundant information, cost could be reduced by a more effective
sampling strategy.
In this report, we explore the potential of DNA pooling to provide a more
cost-effective approach for SNP discovery and genome wide population genetics.
Sequencing a large pool of individuals simultaneously keeps the number of redun-
dant DNA reads low, and provides thus an economic alternative to the sequencing
of individual genomes. On the other hand, more care has to be taken to establish
an appropriate control of sequencing errors. Obviously haplotype information is
not available from pooling experiments, but this will often be outweighed by the
increased accuracy in population genetic inference.
Focusing on biallelic loci, our analysis shows that with sufficiently large pool
sizes, pooling usually outperforms the separate sequencing of individuals, both
for estimating allele frequencies and inference of population genetic parameters.
When sequencing errors are not too common, pooling seems also to be a good
choice for SNP detection experiments. To avoid the additional challenges en-
countered with individual sequencing of diploid individuals, we compare pooling
with individual sequencing of haploid individuals. See Lynch (2008, 2009) for
a discussion of next generation sequencing of diploid individuals. Our results
for the pooling experiments should be also applicable to a diploid setting, as we
are just merging pools of size 2 to a larger pool in this case, leading to a pool
size of n = 2n
d
for n
d
diploid individuals. In the methods section, we derive
several mathematical expressions that permit us to compare pooling with sep-
arate sequencing of individuals. These formulas are then applied in the results
section in order to illustrate the differences in accuracy between the approaches.
A reader who is only interested in the actual differences under several scenarios
might therefore want to move directly to the results section.
4 NGS EXPERIMENTS WITH POOLED SAMPLES
Methods
Throughout, we will consider an individual sequencing project where k indi-
viduals are sequenced each with an expected coverage λ, by which we mean that
any given locus is sequenced λ times on average. For a comparable pooling exper-
iment that involves the same amount of sequencing effort, the expected coverage
will then be kλ, i.e. any particular locus will be read kλ times on average from
the pool consisting of n individuals. In practice, one might for instance sequence
each of the k individuals on a separate Illumina lane with coverage λ. With the
same sequencing effort, the pool could be sequenced on k lanes simultaneously,
leading to a total coverage of kλ.
For the convenience of the reader, we summarize our notation in Table 1.
SNP detection. A SNP is detected at a site if the site is polymorphic, i.e. if
at least two alleles A and a are found in the sequenced sample. We will consider
SNP detection both in the context of pooling experiments and for individual
sequencing. To assess the performance of these two competing scenarios, we will
look both at the power and at the probability of falsely calling a SNP due to
sequencing errors.
Generally speaking, an experimental design that provides high power while
keeping the probability of incorrectly detecting a SNP small, will be preferable.
When individuals are sequenced separately, the probability of sequencing errors
being interpreted as true SNPs can be reduced by a sufficiently high expected
coverage if the genotype of an individual is inferred by the majority of reads.
Note, that in the case of diploid individuals, the distinction between sequencing
errors and true SNPs is significantly more complicated. In pooling experiments, a
simple way to control the probability of falsely detecting SNPs both in the haploid
and in the diploid case is to require a certain minimum number of reads for the
minor allele in order to call a SNP. We extend work by Eberle and Kruglyak
(2000) on SNP detection, and derive both the power and error rates for pooling
experiments and for separate sequencing.
Separate Sequencing of Individuals. Let M
A
, (M
a
) denote the number of
times allele A (a) is sequenced. Given that exactly L
A
= l out of k individuals in
the sample have an allele of type A, the probability of detecting polymorphism is
equal to the probability of reading at least one of the A and one of the remaining
a alleles in the sample. Assuming that for each individual the number of reads at
a particular locus is Poisson distributed with parameter λ, the probability of not
covering the SNP locus for an individual is exp(−λ). This leads to the probability
q
c
(l, k, λ) := (1 −[exp(−λ)]
l
)(1 −[exp(−λ)]
k−l
)
for getting at least one “A” and one “a” read. Notice that for larger values of
λ, this probability is nearly one, except for l = 0 or l = k, where q
c
(l, k, λ) = 0.
Suppose now that our population size N is fairly large and that the relative
frequency of allele A is p in the population. Then, by conditioning on the number
NGS EXPERIMENTS WITH POOLED SAMPLES 5
l of A alleles in the sample, the probability of detecting a SNP is approximately
(1) q(p, k, λ) =
k−1
¸
l=1
q
c
(l, k, λ)

k
l

p
l
(1 −p)
k−l
.
For large values of λ, we obtain that
(2) q(p, k, λ) ≈ 1 −p
k
−(1 −p)
k
.
We will now derive the probability of wrongly detecting a SNP due to sequencing
errors. A natural way to proceed for individual sequencing is to assume that the
most frequently read base for an individual is the true one. The probability that
this leads to the wrong decision depends on the number of reads available for
the locus under investigation, as well as the probability ǫ that a single read for
a given base is incorrect and furthermore on whether the errors are independent.
Concerning the dependence of the reading errors, we consider two extreme sce-
narios. The first, more pessimistic scenario, assumes complete dependence such
that sequencing errors at a given position always lead to the same incorrect base.
The second assumes independent errors such that each sequencing error leads to
an independently chosen wrong base. In this situation, we assume furthermore
that the three possible wrong bases are chosen with the same probability. We
expect the actual error probabilities somewhere between these scenarios.
For the dependent case, we obtain by conditioning on the (Poisson) number of
reads for an individual at a locus
(3) q
(d)
e
(k, λ, ǫ) = 1 −

¸
1 −
¸
r≥1

¸
i>r/2

r
i

ǫ
i
(1 −ǫ)
(r−i)
¸
¸
λ
r
r!
exp(−λ)

k
.
In the independent case, an error is made by choosing one of the three incorrect
bases at random, each with probability ǫ/3. The probability of falsely detecting
a SNP is
(4)
q
(i)
e
(k, λ, ǫ) = 1 −

¸
1 −
¸
r≥1

3
¸
i>r/2

r
i

(ǫ/3)
i
(1 −(ǫ/3))
(r−i)
¸
¸
λ
r
r!
exp(−λ)

k
.
The resulting error probabilities can be made very small by ensuring a coverage
λ that is large enough. Obviously a more sophisticated rule will be needed when
sequencing diploid individuals.
Pooling Experiment. We now assume that a pooled sample of size n is se-
quenced with the same expected total number kλ of reads per locus as for sep-
arate sequencing. Let F
(P)
(b, γ) =
¸
b
i=0
γ
i
i!
exp(−γ) denote the probability that
a Poisson random variable with parameter γ is at most b. Given a frequency
L
A
= l of A alleles in the pool, we obtain the probability of reading at least one
6 NGS EXPERIMENTS WITH POOLED SAMPLES
A and one a allele as
(5)

1 −F
(P)
(0,
lkλ
n
)

1 −F
(P)
(0,
(n −l)kλ
n
)

.
Now this leads to the probability of detecting a SNP
(6)
n−1
¸
l=1

1 −F
(P)
(0,
lkλ
n
)

1 −F
(P)
(0,
(n −l)kλ
n
)

n
l

p
l
(1 −p)
n−l
which occurs with a proportion p in the population.
As sequencing errors are common in NGS, they are easily confounded with low
frequency alleles. A common strategy to reduce the high probability of sequencing
errors is to consider only SNPs that are detected in at least b reads. Requiring a
minimum number b of reads in our context, the probability of detecting a SNP
changes to
(7)
n−1
¸
l=1

1 −F
(P)
(b −1,
lkλ
n
)

1 −F
(P)
(b −1,
(n −l)kλ
n
)

n
l

p
l
(1 −p)
n−l
.
As with individual sequencing, we again derive the probability of wrongly detect-
ing a SNP under two scenarios for the sequencing errors.
In the dependent scenario, the probability of wrong SNP detection equals the
probability
(8) p
(d)
e
(k, λ, ǫ, b) = (1 −F
(P)
(b −1, λkǫ))[1 −F
(P)
(0, λk(1 −ǫ))]
of making at least b sequencing errors and getting at least one correct read. If
the expected number of reads λk is fairly large, the term 1 − F
(P)
(0, λk(1 −ǫ))
is very close to one and can be omitted without changing the results much. With
independent sequencing errors, an upper bound for the probability of falsely
detecting a SNP is given by
(9) p
(i)
e
(k, λ, ǫ, b) = 3

1 −F
(P)
(b −1, λkǫ/3)

.
Allele frequency inference. We consider a locus with expected relative fre-
quency p in the population. Suppose first that the individuals are sequenced
separately with an expected coverage of λ. Then the probability that a specific
locus is read for J = j of the k individuals is
r
j,k
:=

k
j

(1 −e
−λ
)
j
e
−(k−j)λ
.
Given that reads are available for J = j out of the k individuals, the relative
frequency of A alleles is R
c
:= M
A
/j. The variance of R
c
can be obtained as
Var(R
c
) = Var

M
A
J

= E

Var
¸
M
A
J
|J

+ Var

E
¸
M
A
J
|J

.
NGS EXPERIMENTS WITH POOLED SAMPLES 7
Now given J, M
A
is binomial B(J,p) distributed and Var

M
A
J
|J

= p(1 − p)/J.
This leads to
E

Var
¸
M
A
J
|J

= E(1/J)p(1 −p).
Furthermore E[
M
A
J
|J] = p and therefore Var

E[
M
A
J
|J]

= 0. Together
Var(R
c
) = p(1 −p)E(1/J) ≥ p(1 −p)/k.
We now turn to the pooling experiment, assuming again a population propor-
tion, p, of A alleles. With L
A
again denoting the number of A alleles in a pooled
sample of size n, we assume M
A
(M
a
) reads of the A (a) allele from this sample.
This leads to M = M
A
+ M
a
reads for the site under investigation.
The relative frequency of the A allele estimated from the sample is then given
as R
p
= M
A
/M. According to our model M is Poisson Pois(kλ), and with U =
(M, L
A
), M
A
|U is binomial B(M, L
A
/n). We again decompose the variance into
Var(R
p
) = Var

M
A
M

= E

Var
¸
M
A
M
|U

+ Var

E
¸
M
A
M
|U

.
Now Var

M
A
M
|U

=
1
M
L
A
n
n−L
A
n
and E

M
A
M
|U

=
L
A
n
. Together, we obtain
Var(R
p
) = E

1
M

n −1
n
p(1 −p) +
p(1 −p)
n
.
In order to see which experimental setup leads to the smaller variance, we
consider the ratio
(10)
Var(R
p
)
Var(R
c
)
=
E

1
M

n−1
n
+
1
n
E(1/J)
.
It is convenient that the ratio does not depend on the population proportion p of A
alleles anymore. For a large enough expected coverage λ we get E(1/J) ≈ 1/k and
E(1/M) ≈ 1/(kλ). Notice that the variance for the pooling experiment increases
when individuals contribute unequal amounts of probe material. According to
our simulations shown in the results section however, this variance component
can be kept small by choosing pools of large enough size.
Allele frequency estimators for pooled samples that also take into account
quality scores of the individual reads have been discussed in Holt et al. (2009).
The computation of variances for these estimators would depend on the specific
assumptions of a probability model for the quality scores.
Estimating population genetic parameters. Two widely used summary sta-
tistics in population genetics are Tajima’s π and Watterson’s θ. We investigate
the influence of the two sequencing strategies on the accuracy of these summary
statistics. According to our simulations, both summary statistics show a sig-
nificantly smaller variance for pooled samples. However, in particular for small
pools, the estimators show some bias. The reason for the bias is that multiple
8 NGS EXPERIMENTS WITH POOLED SAMPLES
reads of the same sequence are entering the normalizing constant as indepen-
dently sampled sequences, if the estimators are computed in a standard way for
pooled samples. Sequencing errors also lead to bias, and if a minimum minor
allele frequency is required in order to make sequencing errors rare, this needs to
be taken into account. For individual sequencing, the effect of omitting single-
tons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008).
Based on the expected values of Tajima’s π and Watterson’s θ, we introduce
modified normalizing constants that make the resulting estimators unbiased un-
der neutrality. These bias corrected estimators will then be compared with those
obtained from individual sequencing. (See the RESULTS section.)
We first derive a bias correction for Tajima’s π and start by considering a
locus for which M reads are available. We do not consider sequencing errors for
the moment, and focus on the bias that is caused by possibly reading the same
sequence more than once. Let ∆
ij
denote the number of differences between the
sequences i and j at this locus that are selected randomly with replacement from
the pool of n individuals. Now for this locus
E
ˆ
θ
π
= E
¸
i =j

ij
/

M
2

= E∆
IJ
= E[∆
IJ
|I = J]P(I = J) +E[∆
IJ
|I = J]P(I = J)
= 0 +θP(I = J)
= θ
n −1
n
. (11)
Therefore
n
n−1
ˆ
θ
π
will be unbiased, if we neglect sequencing errors. Since this
bias correction only depends on the size n of the pool and not on the coverage
by reads, a bias corrected version of Tajima’s π for the entire sequence can be
obtained by adding up individual values of
ˆ
θ
π,l
for all loci and then multiplying
by
n
n−1
, leading to
ˆ
θ

π
=
n
n−1
¸
l
ˆ
θ
π,l
.
In order to also correct for sequencing errors, two approaches seem feasible. If
an unbiased estimate for the sequencing errors is available, such an estimate could
be used to correct
ˆ
θ

π
. Analogous to Achaz (2008, equation (1)) for the standard
experimental setup,
ˆ
θ

π
−2
n
n−1
ˆ µ
err
will be unbiased, if ˆ µ
err
is an unbiased estimate
of the number of reading errors per sequence. Introducing ˆ µ
err
will obviously add
to the variance of the resulting estimator and the overall performance will depend
on the accuracy of ˆ µ
err
. Another way to take into account sequencing errors is
to require a minimum minor allele frequency b for including a segregating site
in the analysis, and to ignore sequencing errors subsequently. The idea is that
sequencing errors will be rare if b is sufficiently large.
Again, we first consider a locus for which the coverage is equal to M. Let
ˆ
θ
(b)
π
denote the version of Tajima’s π where the minor allele frequency is required to
NGS EXPERIMENTS WITH POOLED SAMPLES 9
be at least b. Notice that
ˆ
θ
(b)
π
=
ˆ
θ
π
for b = 1. With K
m
denoting the number of
sites where the derived allele A has frequency m,
ˆ
θ
(b)
π
may be written as
ˆ
θ
(b)
π
=

M
2

−1 M−b
¸
m=b
K
m
m(M −m)
for a locus for which M reads are available (see section 1.4 in Durrett (2008)).
Let c
n
= 1/
¸
n−1
i=1
i
−1
, and let furthermore X
M
denote the number of A alleles
among the reads, and Y
n
the number of A alleles in the pool. Then
P(X
M
= m|Y
n
= r) =

M
m

(r/n)
m
(1 −r/n)
M−m
and under neutrality P(Y
n
= r) = r
−1
/c
n
. With c
n
θ being the expected number
of segregating sites in the pool,
(12) E(
ˆ
θ
(b)
π
) =

M
2

−1
c
n
θ
M−b
¸
m=b
n−1
¸
r=1
m(M −m)P(X
M
= m|Y
n
= r)P(Y
n
= r)
For b = 1, straightforward calculations reproduce (11), i.e.
E(
ˆ
θ
(1)
π
) = θ
n −1
n
.
For b > 1 the sum does not simplify much, but can be computed and turned into
the bias correction factor

M
2

[
M−b
¸
m=b
n−1
¸
r=1
m(M −m)P(X
M
= m|Y
n
= r)r
−1
]
−1
.
However, an accurate approximation for (12) can be obtained by assuming that
n is large compared to M. In this case
n−1
¸
r=1
P(X
M
= m|Y
n
= r)P(Y
n
= r) ≈ c
−1
n
1
m
for 1 ≤ m ≤ M −1 and therefore
(13) E(
ˆ
θ
(b)
π
) ≈ θ
M −2b + 1
M −1
.
For b > 1, the resulting simple bias correction factor
M−1
M−2b+1
turns out to provide
very good approximations, even if the pool size n is only moderately larger than
the number of reads M. Indeed, if singletons are omitted (b = 2), then the relative
error is only 0.4% when M = 10 and n = 20. For n = 200 and M = 50, the error
drops to 0.02% for b = 2 and 4 · 10
−5
% for b = 3. Summarizing, we propose the
following bias corrected version of Tajima’s π :
(14)
ˆ
θ
(b)∗
π
=

n
n−1
ˆ
θ
π
for b = 1,
M−2b+1
M−1
ˆ
θ
(b)
π
for b > 1.
10 NGS EXPERIMENTS WITH POOLED SAMPLES
To obtain an overall estimate based on L loci with possibly unequal coverage M
l
(1 ≤ l ≤ L), simply take the sum over the individually bias corrected estimates
(15)
ˆ
θ
(b)∗
π
=
L
¸
l=1
ˆ
θ
(b)∗
π,l
.
Dividing
ˆ
θ
(b)∗
π
by the total length of the considered sequence, an estimator for the
scaled mutation parameter per base results.
We now derive a bias correction for Watterson’s estimator, again first focusing
on a locus with coverage M. We consider a version of Watterson’s estimator that
requires a minimum minor allele frequency b. For b = 1 we use all segregating
sites, and versions that protect against sequencing errors can be obtained by
choosing b > 1. Let S
b
denote the number of segregating sites found in the M
sequence reads from the pool for which the minor allele frequency is at least b.
Then
(16)
ˆ
θ
(b)
W
:=
S
b
¸
M−1
i=1
1/i
provides protection against sequencing errors. if b is large enough. Analogous to
(12), we obtain that conditional on the number of reads M for the locus
(17) E(
ˆ
θ
(b)
W
|M) =
c
n
c
M
θ[
M−b
¸
m=b
n−1
¸
r=1
P(X
M
= m|Y
n
= r)P(Y
n
= r)].
Let F
(B)
(x, M, p) denote the probability that a binomial random variable X sat-
isfies P(X ≤ x) for M trials with success probability p. In particular for p = r/n,
F
(B)
(x, M, r/n) =
x
¸
i=0

M
i

r
n

i

1 −
r
n

M−i
.
Recall furthermore that c
M
=
¸
M−1
i=1
i
−1
. Then a bias corrected version of
ˆ
θ
(b)
W
for b ≥ 1 is given as
(18)
ˆ
θ
(b)∗
W
=
ˆ
θ
(b)
W
c
M
¸
n−1
r=1
[F
(B)
(M −b, M, r/n) −F
(B)
(b −1, M, r/n)]
1
r
.
As with Tajima’s π,
ˆ
θ
(b)∗
W
can be easily adapted to work with longer sequences.
For this purpose, partition the sequence into L loci such that for each locus a
constant number of reads M
l
is available, and obtain the bias corrected Watterson
estimate
ˆ
θ
(b)∗
W,l
separately for each locus l. Then
(19)
ˆ
θ
(b)∗
W
=
L
¸
l=1
ˆ
θ
(b)∗
W,l
,
NGS EXPERIMENTS WITH POOLED SAMPLES 11
provides an estimate of the overall scaled mutation parameter. Dividing
ˆ
θ
(b)∗
W
by
the total length of the considered sequence, an estimator for the scaled mutation
parameter per base results.
Results
SNP Detection.
For many biological applications SNP genotyping provides a cost effective ap-
proach, and SNP discovery is the first step required. We compared the efficiency
of SNP discovery using an approach in which each individual is sequenced sepa-
rately with a pooling approach. The panels of Figure 1 show that the comparative
efficiency of pooling depends both on the expected coverage and on the minimum
number of reads for allele calling used for error protection. While pooling experi-
ments provide a higher probability of SNP detection in most cases, it is expected
to be less efficient, if both the coverage is small and a a high minimum number
of reads is required. This is not entirely unexpected, since an increased number
of reads required for the inference of the minor allele reduces the probability of
detecting SNPs in a pooling experiment. The higher the expected coverage, the
more inefficient individual sequencing becomes. As long as not chosen too small,
the size of the pool seems to play a less important role. Figure 2 addresses the
problem of wrongly identifying a sequencing error as a SNP. Irrespective of the
assumed model of sequencing errors (see methods section for further details), a
high probability of sequencing errors makes SNP calling from pools highly unre-
liable. On the other hand, if sequencing error rates are reduced (e.g. by quality
filtering), a suitable lower bound on the minimum allele frequency for detecting
a SNP makes pooling very reliable for the identification of SNPs. Interestingly,
in some cases, we found pooling to result in fewer erroneous SNP calls than
individual sequencing.
Allele frequency inference.
In population genetics, the allele frequency spectrum is of central interest. Es-
timating the allele frequency spectrum of a population is subject to sampling
variation. In an individual based sequencing strategy, most of the sampling vari-
ation comes from the selection of individuals used for DNA sequencing. The
advantage of the pooling approach is that this sampling error can be dramati-
cally reduced by including a large number of individuals in the pool. On the
other hand, a second level of sampling error arises in the pooling approach from
the fact that not all chromosomes in the pool are sequenced and some chromo-
somes may be sequenced more than once. We start by discussing the situation
where individuals contribute equal amounts of probe material and refer to the
last paragraph of the section for the case when this assumption is violated.
In the methods section, we obtained expression (10) for the ratio of the vari-
ances of the estimated relative allele frequency both for a pooling experiment R
p
(pool size n), and a classical experiment with individual sequencing R
c
. For a
12 NGS EXPERIMENTS WITH POOLED SAMPLES
large enough expected coverage λ and with k individuals sequenced, this equa-
tion can be approximated by the following quick rule of thumb: Pooling will lead
to a smaller variance for those experimental setups that satisfy 1/λ +
k
n
< 1 or
equivalently n/(n−k) < λ. Thus a case where pooling provides a better estimate
of the allele frequency is when the pool contains more than twice the number
of separately sequenced individuals and the coverage λ per separately sequenced
individual is at least two. For larger pools smaller values of λ will be sufficient.
So far we compared the individual based and pooling strategy only for the same
number of sequenced reads. Alternatively, the superiority of the pooling approach
could be expressed by the reduction of sequencing costs. Figure 3 compares the
pooling approach to sequencing of individuals when both methods provide the
same accuracy for allele frequency estimates. Suppose that k individuals are
sequenced separately, each at an expected coverage λ. Then k

indicates the
cost in single genome sequencing equivalents that results in the same accuracy
as sequencing k genomes individually. If, for instance, k = 20 and k

= 10, then
pooling would give the same accuracy with half the sequencing effort, correspond-
ing to an individual sequencing project with 10 instead of 20 individuals. Figure 3
clearly indicates that larger pool sizes increase the advantage of sequencing pools.
A higher sequence coverage (λ) for sequencing of individuals further improves the
cost effectiveness of pooling.
In genome-wide association studies (GWA’s), the association between allele
frequencies and traits (diseases) is investigated. A possible approach is to test
whether alleles have different frequencies in two pools that differ with respect to
the trait of interest, (see Sham et al. (2002)). Since the ratio of variances (10)
does not depend on the allele frequencies in the sub-populations, the standard
deviation entering the test statistic will differ by the square root of (10) between
a pooling and a classical experimental setup. If the square root of (10) is 1/2
(say), the shift of the expected value of the test statistic under the alternative
will be twice as large in a pooling experiment: Overall pooling will be the more
powerful approach, whenever the variance ratio is smaller than one (see (10)). It
should be noted however that the variance of the pooling experiment will become
larger if individuals contribute unequal amounts of probe material. This issue
will be addressed in the last paragraph of this section.
Estimating population genetic parameters.
We now compare the estimation of the scaled mutation parameter using Watter-
son’s θ and Tajima’s π under our two experimental setups. For this purpose, we
simulated 100 samples under neutrality with mutation parameter θ = 10, using
the ms software (Hudson, 2002).
(./ms 500 100 -t 10 > ms.out)
where 500 is the number of sequences generated. For separate sequencing, we took
random sub-samples of size k = 10 from each sample, thus simulating separate
sequencing of 10 individuals each with an expected number λ of reads. With
NGS EXPERIMENTS WITH POOLED SAMPLES 13
pooling, we took samples of size n out of the 500 simulated sequences. From
this pool, reads were taken independently for each locus l by making a random
number of draws M
l
with replacement. The quantities M
l
were chosen according
to a Poisson distribution with expected value kλ.
For Tajima’s π, we used the bias correction (14) for individual loci and added
the estimates across loci using (15)). For Watterson’s θ, the bias has been cor-
rected using formula (18) for each locus.
Neglecting sequencing errors for the moment, it turns out that the pooling
approach with bias correction leads to more accurate estimates of θ and π, pro-
vided that the size of the pool is large enough. For small pools, multiple reads of
the same chromosome become more common, which affects the accuracy of the
estimates negatively. (Figure 5.)
We now investigate the pooling approach when including a protection against
sequencing errors by removing all segregating sites where the minor allele has
frequency x satisfies x = 1 or alternatively x ≤ 2. Again, the normalizing
constants have been adapted in order to avoid bias. Let b denote the minimum
required minor allele frequency.
Figure 5 shows the relative advantage of pooling conditional on different min-
imum minor allele frequencies. Pooling still leads to a decreased variance under
neutrality as long as the pool size is large enough. Not unexpectedly, the reduc-
tion in variance is now somewhat smaller for Watterson’s
ˆ
θ
(b)∗
W
. The increase in
the variance of Tajima’s
ˆ
θ
(b)∗
π
is much smaller, since frequency one minor alleles
receive a low weight in the calculation of π.
Unequal amounts of probe material. One obvious source of error in the
pooling approach is the heterogeneity in DNA amounts due to measurement
errors. In experiments that rely on PCR amplification, the heterogeneity can
be expected to be particularly strong.
Individuals for which a larger DNA amount has been included in the DNA
pool, will be over-represented, which potentially causes a change in allele fre-
quency estimates. This affects the bias and the variance also for our considered
population genetic summary statistics.
To investigate the sensitivity of population genetic estimates based on pool-
ing experiments, we simulated a scenario involving unequal amounts of probe
material. We set the expected amount of probe material to one, and allowed for
log-normally distributed multiplicative deviations from this expected value. More
specifically, the deviation factors have been chosen independently for each indi-
vidual contributing to the pool according to exp(X
i
), where X
i
(1 ≤ i ≤ n) are
normal N(0, log(scale)) random variables. Thus the median amount of probe ma-
terial is always equal to one. If the deviation factor has a value of exp(X
i
) = 1.5,
this means that the respective individual will have a 50% higher chance of being
sequenced, than another with a factor of exp(X
i
) = 1. Similarly, a value of 0.8
means a 20% decreased chance of being read.
14 NGS EXPERIMENTS WITH POOLED SAMPLES
As our first scenario (scale = 2), slightly more than 30% of all individuals
differed at least twofold from the median. In other words, for a pool of size
n = 100, the most abundant individual contributed about sixteen times the probe
material of the least abundant individual. We also simulated a more extreme
scenario (scale = 8) where about 30% of the individuals differed at least eightfold
from the median. As further parameters we chose λ = 30, k = 10, n ∈ [5, 200].
As the amount of heterogeneity in the sample will usually be unknown, we
applied the same bias correction as for equal amounts of probe material. We
measured the deviation from the true θ by the mean squared error, as this ac-
counts for bias and variance.
Figure 6 displays the effect of heterogeneity in probe material on the bias and
the variance of Tajima’s π and Watterson’s θ. Although both bias and variance
change noticably for higher levels of heterogeneity, these effects cancel out to
a large extent. Thus the overall performance measured in terms of the mean
squared error
(20) MSE = Bias
2
+ Var
changes only marginally even for a large level of heterogeneity (scale = 8), see
Figure 7. This effect can be explained by shrinkage that leads to improved es-
timates of the mutation parameter θ by permitting for some bias (Futschik and
Gach, 2008). Interestingly, even for a high level of heterogeneity in probe mate-
rial (scale =8), the performance (measured in terms of the MSE) changes only
marginally.
Heterogeneity in probe material also affects the accuracy of the estimated allele
frequencies, as the variance of the estimator based on a pooled sample becomes
larger. However, this effect can be kept small, by choosing a pool of a large
enough size. This is illustrated in Figure 8, where it can be seen that pooling
leads for large enough pool sizes eventually to smaller variances even for a high
level of heterogeneity in probe material (scale =8).
Discussion
Over the past decades we have been witnessing a continuous turnover of molec-
ular markers used in genetical research. To a large extent this turnover has been
driven by the advances in molecular biology and technology. With the arrival of
the second generation sequencing technologies this race is about to come to an
end - rather than relying on a more or less representative fraction of the genome,
it has come into reach to have full genomic sequences available for multiple indi-
viduals.
With further technological advances, it is anticipated that it will become pos-
sible to sequence individual genomes at a cost that allows even small laboratories
to perform population analyses on a genome scale. Currently, this is not pos-
sible as the costs are still too high. In this study, we showed that sequencing
NGS EXPERIMENTS WITH POOLED SAMPLES 15
pools of individuals provides an excellent alternative that permits genome wide
polymorphism surveys at very moderate costs.
This is the first report systematically exploring the parameter range for which
DNA pooling provides an advantage compared to individual genome sequencing.
Our result that NGS of DNA pools often provides a reliable and cost effective
mean for genome-wide allele frequency estimates, is supported by some recent
studies using NGS to analyze DNA pools of selected genomic regions. (Van Tas-
sell et al., 2008) sequenced a complexity reduced DNA pool using the Illumina
Genome Analyzer. For a subset of the identified SNPs, they compared the allele
frequency estimates from the Illumina sequencing to those obtained by genotyping
the same individuals. Despite that SNP frequency estimates were undoubtedly
affected by a substantial assignment error (Palmieri and Schl¨otterer, 2009) due
to the short reads and the complexity reducing procedure, (Van Tassell et al.,
2008) observed a correlation of 0.67 between the two methods. Hence, there is
very little doubt that NGS is an effective tool to provide accurate genome-wide
allele frequency estimates from DNA pools.
We anticipate that the analysis of DNA pools provides a wide range of ap-
plications. In population genetics, it will be possible to compare patterns of
differentiation on a genomic scale. Thus, patterns of local adaptation and het-
erogeneity in gene flow among different genomic regions can be identified. Also
for association mapping DNA pools are very powerful (Sham et al., 2002). In con-
trast to SNP arrays, however, re-sequencing of DNA pools will always include the
causative SNP and thus provide a higher statistical power. Our study provides
the basis for an adequate experimental design of future pooling experiments.
ACKNOWLEDGEMENTS
This work has been supported by a WWTF grant to AF and CS as well as FWF
Grants (P 19832-B11, L403-B11) awarded to CS. Special thanks to C. Kosiol, N.
de Maio, and R. Kofler for helpful comments on earlier versions of the manuscript.
We are grateful to A. Vasem¨agi and J. Wolf on general discussions about pooling
for NGS, and also thank the reviewers for helpful comments.
16 NGS EXPERIMENTS WITH POOLED SAMPLES
References
Achaz, G., 2008. Testing for neutrality in samples with sequencing errors. Genet-
ics 179, 1409–1424.
Durrett, R., 2008. Probabiliy models for DNA Sequence Evolution. Springer, New
York.
Eberle, M., Kruglyak, L., 2000. An analysis of strategies for discovery of single-
nucleotide polymorphisms. Genet. Epidemiol. 19, S29–S35.
Erlich, Y., Chang, K., Gordon, A., Ronen, R., Navon, O., Rooks, M., Han-
non, G. J., 2009. Dna sudoku-harnessing high-throughput sequencing for mul-
tiplexed specimen analysis. Genome Research 19, 1243–1253.
Futschik, A., Gach, F., 2008. On the inadmissibility of watterson’s estimate.
Theoretical Population Biology 73, 212–221.
Holt, K., Teo, Y., Li, H., Nair, S., Dougan, G., Wain, J., Parkhill, J., 2009.
Detecting snps and estimating allele frequencies in clonal bacterial populations
by sequencing pooled dna. Bioinformatics 25, 2074–2075.
Hudson, R. R., 2002. Generating samples under a wright-fisher neutral model of
genetic variation. Bioinformatics 18, 337–338.
Jiang, R., Tavar´e, S., Marjoram, P., 2009. Population genetic inference from
resequencing data. Genetics 181, 187–197.
Knudsen, B., Miyamoto, M. M., 2009. Accurate and fast methods to estimate
the population mutation rate from error prone sequences. BMC Bioinformatics
10:247, doi:10.1186/1471–2105–10–247.
Lynch, M., 2008. Estimation of nucleotide diversity, disequilibrium coefficients,
and mutation rates from high-coverage genome-sequencing projects. Mol. Biol.
Evol. 25, 2421–2431.
Lynch, M., 2009. Estimation of allele frequencies from high-coverage genome-
sequencing projects. Genetics 182, 295–301.
Palmieri, N., Schl¨otterer, C., 2009. Mapping accuracy of short reads from mas-
sively parallel sequencing and the implications for quantitative expression pro-
filing. PloS one 4, e6323+.
Sham, P., Bader, J. S., Craig, I., M., O., M., O., 2002. Dna pooling: A tool for
large-scale association studies. Nature Rev. Genet. 3, 862–871.
Van Tassell, C. P. P., Smith, T. P. L. P., Matukumalli, L. K. K., Taylor, J.
F. F., Schnabel, R. D. D., Lawley, C. T. T., Haudenschild, C. D. D., Moore, S.
S. S., Warren, W. C. C., Sonstegard, T. S. S., 2008. Snp discovery and allele
frequency estimation by deep sequencing of reduced representation libraries.
Nat. Methods.
NGS EXPERIMENTS WITH POOLED SAMPLES 17
Table 1. Description of our Notation
Symbol or Notation Description
k number of haploid individuals used for separate sequencing
λ expected number of times a locus is read for an individual using
separate sequencing
n size of the pool in a pooling experiment
J random number of individuals for which reads are actually available
at a particular locus with individual sequencing (J ≤ k)
M random number of reads for a particular locus in a pooling experi-
ment (E(M) = kλ)
p relative frequency of the allele of interest in the population
F
(P)
(b, γ) Poisson cumulative distribution function (F
(P)
(b, γ) =
¸
b
i=0
γ
i
i!
exp(−γ))
F
(B)
(x, M, p) binomial cumulative distribution function (F
(B)
(x, M, p) =
¸
x
i=0

M
i

p
i
(1 −p)
M−i
.
ˆ
θ
(b)∗
π
bias corrected version of Tajima’s π for a pooling experiment when
the minor allele frequency is required to be at least b. For b = 1,
ˆ
θ
(b)∗
π
=
ˆ
θ

π
.
ˆ
θ
(b)∗
W
bias corrected version of Watterson’s θ for a pooling experiment
when the minor allele frequency is required to be at least b ≥ 1.
18 NGS EXPERIMENTS WITH POOLED SAMPLES
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
A: λ=5, n=50
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
A: λ=5, n=200
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
B: λ=10, n=50
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
B: λ=10, n=200
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
C: λ=20, n=50
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
0.0 0.1 0.2 0.3 0.4 0.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
C: λ=20, n=200
freq in pop.
p
r
o
b

o
f

S
N
P

d
e
t
e
c
t
io
n
Figure 1. Probability of detecting a SNP with relative minor al-
lele frequency p in the population when a certain minimum number
of reads is required as a detection threshold. The colored lines in-
dicate the probabilities for sequencing experiments using a pooled
sample (purple-dashed: no error correction, red-dotted: minor al-
lele frequency (m.a.f.) at least 2, blue-dash-dot: m.a.f. ≥ 4, green
long dashed: : m.a.f. ≥ 6). Solid black line: Experiment where
k = 10 haploid individuals are sequenced separately. Expected
coverage λ = 5(A), 10(B), 20(C) per individual. For pooling ex-
periments, the expected total coverage is kλ. Pool sizes are either
50 (left column) or 200 (right column).
NGS EXPERIMENTS WITH POOLED SAMPLES 19
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
A: λ=5, errors dependent
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
A: λ=5, errors i.i.d.
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
B: λ=10, errors dependent
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
B: λ=10, errors i.i.d.
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
C: λ=20, errors dependent
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0

8

6

4

2
0
C: λ=20, errors i.i.d.
log_10 sequencing error prob.
lo
g
_
1
0

p
r
o
b
.

o
f

f
a
ls
e

S
N
P

c
a
llin
g
Figure 2. Log-probability of falsely detecting a SNP at a non-
segregating site, in dependance on the logarithm of the sequencing
error probability. The colored lines indicate the probabilities for
sequencing experiments using a pooled sample (purple-dashed: no
error correction, red-dotted: minor allele frequency (m.a.f.) at least
2, blue-dash-dot: m.a.f. ≥ 4, green long dashed: : m.a.f. ≥ 6).
Solid black line: Experiment where k = 10 haploid individuals are
sequenced separately and the most frequently read base at a po-
sition is chosen for the sequenced individual. Expected coverage
λ = 5(A), 10(B), 20(C) per individual. For pooling experiments,
the expected total coverage is kλ. Since the pool size is not rele-
vant in this context, we plot results for completely dependent (left
column) and independent sequencing errors (right) instead. See
the methods section for a more detailed description of these sce-
narios.
20 NGS EXPERIMENTS WITH POOLED SAMPLES
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
5 10 15 20 25 30 35 40
0
1
0
2
0
3
0
4
0
(a) lambda=5
k
c
o
r
r
e
s
p
o
n
d
i
n
g

k

w
i
t
h

p
o
o
l
i
n
g
+++++++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ooooooooooooooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
5 10 15 20 25 30 35 40
0
1
0
2
0
3
0
4
0
(b) lambda=10
k
c
o
r
r
e
s
p
o
n
d
i
n
g

k

w
i
t
h

p
o
o
l
i
n
g
++++++++++++++++++++++++++++++++++
+
+
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Figure 3. Sequencing effort k

of a pooling experiment in or-
der to get allele frequency estimates with the same accuracy as in
a standard experimental setup where k individuals are sequenced
separately. (“o”: pool size n = 50, “+”: n = 100, “x”: n = 500.)
NGS EXPERIMENTS WITH POOLED SAMPLES 21
0 50 100 200
0
2
4
6
8
1
0
Watterson’s theta
n
m
e
a
n

e
s
t
i
m
a
t
e
0 50 100 200
0
2
4
6
8
1
0
Tajima’s pi
n
m
e
a
n

e
s
t
i
m
a
t
e
Figure 4. Expected value of the estimates obtained from pooled
samples depending on the pool size n: Watterson’s θ and Tajima’s
π. True value θ = 10 (green line). There is a considerable bias,
if n is small compared to kλ, illustrating the need to use a bias
correction with the estimates. Solid black line: λ = 30; red dashed
line: λ = 5. (For Tajima’s π, the bias does not depend on λ.)
22 NGS EXPERIMENTS WITH POOLED SAMPLES
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Watterson’s theta, m.a.f>=1
n
v
a
r
i
a
n
c
e

r
a
t
i
o
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Tajima’s pi, m.a.f.>=1
n
v
a
r
i
a
n
c
e

r
a
t
i
o
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Watterson’s theta, m.a.f.>=2
n
v
a
r
i
a
n
c
e

r
a
t
i
o
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Tajima’s pi, m.a.f.>=2
n
v
a
r
i
a
n
c
e

r
a
t
i
o
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Watterson’s theta, m.a.f.>=3
n
v
a
r
i
a
n
c
e

r
a
t
i
o
0 50 100 150 200
0
.
4
0
.
8
1
.
2
1
.
6
Tajima’s pi, m.a.f.>=3
n
v
a
r
i
a
n
c
e

r
a
t
i
o
Figure 5. Variance ratio (V ar
pooled
/V ar
standard
) of the bias cor-
rected version of Watterson’s θ and Tajima’s π depending on the
pool size n. We consider pooling both without (minor allele fre-
quency (m.a.f.) ≥ 1), and with a protection (m.a.f. ≥ 2, m.a.f. ≥ 3)
against sequencing errors. (Only segregating sites with minor allele
frequency m.a.f. above the stated threshold are included.) The
horizontal green line denotes the break even ratio of one, where
both the pooled and the classical experiment leads to estimates
with equal variances. Pooling always performs better, as soon as
the size of the pool exceeds the number of separately sequenced
individuals. Solid black line: λ = 30; red dashed line: λ = 5.
Standard setup with k = 10 individuals sequenced separately.
NGS EXPERIMENTS WITH POOLED SAMPLES 23
0 50 100 150 200
0
2
4
6
8
1
0
Watterson’s theta
n
b
i
a
s

(
s
o
l
i
d
)
,

s
d

(
d
a
s
h
e
d
)
0 50 100 150 200
0
2
4
6
8
1
0
Tajima’s pi
n
b
i
a
s

(
s
o
l
i
d
)
,

s
d

(
d
a
s
h
e
d
)
0 50 100 150 200
0
1
0
3
0
5
0
Watterson’s theta
n
s
q
u
a
r
e
d

b
i
a
s

(
s
o
l
i
d
)
,

v
a
r

(
d
a
s
h
e
d
)
0 50 100 150 200
0
1
0
3
0
5
0
Tajima’s pi
n
s
q
u
a
r
e
d

b
i
a
s

(
s
o
l
i
d
)
,

v
a
r

(
d
a
s
h
e
d
)
Figure 6. Bias (solid lines) and variance (dashed lines) of Watter-
son’s θ and Tajima’s π depending on the extent of heterogeneity in
probe material. Black lines: moderate heterogeneity (scale =2); red
lines: high heterogeneity (scale =8). In the first row bias and stan-
dard deviations are plotted for the population genetic estimates.
The second row contains the squared bias and the variance, that
add up to the mean squared error. (Further parameters: λ = 30,
k = 10; log-normal parameters: µ = 0, σ = log(scale).)
24 NGS EXPERIMENTS WITH POOLED SAMPLES
0 50 100 150 200
0
.
0
0
.
5
1
.
0
1
.
5
Watterson’s theta, scale = 2
n
M
S
E

r
a
t
i
o
0 50 100 150 200
0
.
0
0
.
5
1
.
0
1
.
5
Tajima’s pi, scale = 2
n
M
S
E

r
a
t
i
o
0 50 100 150 200
0
.
0
0
.
5
1
.
0
1
.
5
Watterson’s theta, scale = 8
n
M
S
E

r
a
t
i
o
0 50 100 150 200
0
.
0
0
.
5
1
.
0
1
.
5
Tajima’s pi, scale = 8
n
M
S
E

r
a
t
i
o
Figure 7. Mean squared error ratio (MSE
pooled
/MSE
standard
) of
Watterson’s θ and Tajima’s π depending on the pool size n and
for λ = 30. Solid black line: The same amount of probe material
is available for all individuals. Red dashed line: the amount of
probe material differs from individual to individual according to
log-normal factors. For the top two panels, both curves are nearly
identical. The median factor is always one, and with a scale of two,
about 32% of all probes deviate by a factor of more than the value
given by scale. For a scale value of two (for instance), 16% of probes
involve more than double the median probe amount, and another
16% contain less than one half the median amount. (Log-normal
parameters: µ = 0, σ = log(scale), scale ∈ {2, 8}.)
NGS EXPERIMENTS WITH POOLED SAMPLES 25
100 200 300 400 500
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
n
v
a
r
i
a
n
c
e
s

r
a
t
i
o
s
Figure 8. Variance ratios (V ar
pooled
/V ar
standard
) when estimat-
ing allele frequencies in the case where the amount of probe ma-
terial differs from individual to individual according to log-normal
factors. The median factor is always one, and with a scale of s,
about 32% of all probes deviate by a factor of more than s. For
the scale value s = 2 (for instance) 16% of probes involve more
than double the median probe amount, and another 16% contain
less than one half the median amount. Ratios smaller than one
indicate that pooling leads to estimates with a smaller variance.
Individual sequencing is carried out for ten individuals with an ex-
pected coverage of λ = 10. Scales: s = 2 (red dashed line), s = 4
(green dotted line), s = 8 (blue dash-dotted line) (Log-normal pa-
rameters: µ = 0, σ = log(scale), scale ∈ {2, 4, 8}.)

2

NGS EXPERIMENTS WITH POOLED SAMPLES

Abstract Next generation sequencing (NGS) is about to revolutionize genetic analysis. Currently NGS techniques are mainly used to sequence individual genomes. Due to the high sequence coverage required, the costs for population scale analyses are still too high to allow an extension to non-model organisms. Here, we show that NGS of pools of individuals is often more effective in SNP discovery and provides more accurate allele frequency estimates, even when taking sequencing errors into account. We modify the population genetic estimators Tajima’s π and Watterson’s θ to obtain unbiased estimates from NGS pooling data. Given the same sequencing effort, the resulting estimators often show a better performance than those obtained from individual sequencing. Although our analysis also shows that NGS of pools of individuals will not be preferable under all circumstances, it provides a cost effective approach to estimate allele frequencies on a genome-wide scale.

we explore the potential of DNA pooling to provide a more cost-effective approach for SNP discovery and genome wide population genetics. Obviously haplotype information is not available from pooling experiments. 2009) for a discussion of next generation sequencing of diploid individuals. cost could be reduced by a more effective sampling strategy. which permits sequencing of entire genomes at a fraction of the costs for Sanger sequencing. for the first time it has become feasible to obtain the complete genomic sequence for a large number of individuals. D.NGS EXPERIMENTS WITH POOLED SAMPLES 3 Next generation sequencing (NGS) is about to revolutionize biology. (2009) for the modeling of sequencing errors and Erlich et al. more care has to be taken to establish an appropriate control of sequencing errors. melanogaster and A. As many of the reads only provide redundant information. In order to obtain full coverage of the entire genome and to have high confidence that all heterozygous sites were discovered. pooling seems also to be a good choice for SNP detection experiments. See Lynch (2008. NGS provides an enormous number of reads. we derive several mathematical expressions that permit us to compare pooling with separate sequencing of individuals. despite the enormous cost reduction. A reader who is only interested in the actual differences under several scenarios might therefore want to move directly to the results section. and provides thus an economic alternative to the sequencing of individual genomes. Through a massive parallelization. Sequencing a large pool of individuals simultaneously keeps the number of redundant DNA reads low. When sequencing errors are not too common. our analysis shows that with sufficiently large pool sizes. In this report. as we are just merging pools of size 2 to a larger pool in this case. we compare pooling with individual sequencing of haploid individuals. . See for instance Jiang et al. pooling usually outperforms the separate sequencing of individuals. both for estimating allele frequencies and inference of population genetic parameters. To avoid the additional challenges encountered with individual sequencing of diploid individuals. The extraction of as much statistical information as possible at cost as low as possible has therefore already attracted considerable interest. Current genome-wide re-sequencing projects collect the sequences individual by individual. but this will often be outweighed by the increased accuracy in population genetic inference. For several organisms. genome sequencing on a population scale is still out of reach for the budget of most laboratories. Hence. including human. On the other hand. In the methods section. Our results for the pooling experiments should be also applicable to a diploid setting. Focusing on biallelic loci. Nevertheless. thaliana. leading to a pool size of n = 2nd for nd diploid individuals. it is required that genomes are sequenced at a sufficiently high coverage. (2009) for the efficient tagging of sequences. large re-sequencing projects are well on their way. These formulas are then applied in the results section in order to illustrate the differences in accuracy between the approaches.

e.e. except for l = 0 or l = k. Generally speaking. where qc (l. Separate Sequencing of Individuals. Assuming that for each individual the number of reads at a particular locus is Poisson distributed with parameter λ. i. Given that exactly LA = l out of k individuals in the sample have an allele of type A. one might for instance sequence each of the k individuals on a separate Illumina lane with coverage λ. With the same sequencing effort. We will consider SNP detection both in the context of pooling experiments and for individual sequencing. For the convenience of the reader. we will consider an individual sequencing project where k individuals are sequenced each with an expected coverage λ. In pooling experiments. the expected coverage will then be kλ. To assess the performance of these two competing scenarios. This leads to the probability qc (l. Notice that for larger values of λ. i. We extend work by Eberle and Kruglyak (2000) on SNP detection. that in the case of diploid individuals. by which we mean that any given locus is sequenced λ times on average. k. and derive both the power and error rates for pooling experiments and for separate sequencing. SNP detection. (Ma ) denote the number of times allele A (a) is sequenced. leading to a total coverage of kλ. a simple way to control the probability of falsely detecting SNPs both in the haploid and in the diploid case is to require a certain minimum number of reads for the minor allele in order to call a SNP. Let MA . the probability of detecting polymorphism is equal to the probability of reading at least one of the A and one of the remaining a alleles in the sample. λ) := (1 − [exp(−λ)]l )(1 − [exp(−λ)]k−l ) for getting at least one “A” and one “a” read. Note. the probability of sequencing errors being interpreted as true SNPs can be reduced by a sufficiently high expected coverage if the genotype of an individual is inferred by the majority of reads. the distinction between sequencing errors and true SNPs is significantly more complicated. this probability is nearly one. λ) = 0. A SNP is detected at a site if the site is polymorphic. an experimental design that provides high power while keeping the probability of incorrectly detecting a SNP small.4 NGS EXPERIMENTS WITH POOLED SAMPLES Methods Throughout. In practice. any particular locus will be read kλ times on average from the pool consisting of n individuals. Then. For a comparable pooling experiment that involves the same amount of sequencing effort. we will look both at the power and at the probability of falsely calling a SNP due to sequencing errors. the probability of not covering the SNP locus for an individual is exp(−λ). Suppose now that our population size N is fairly large and that the relative frequency of allele A is p in the population. k. we summarize our notation in Table 1. if at least two alleles A and a are found in the sequenced sample. by conditioning on the number . When individuals are sequenced separately. the pool could be sequenced on k lanes simultaneously. will be preferable.

The probability of falsely detecting a SNP is (4) k    r r λ (i) 3 qe (k. ǫ) = 1 − 1 − (ǫ/3)i (1 − (ǫ/3))(r−i)  exp(−λ) . we obtain the probability of reading at least one . k. We expect the actual error probabilities somewhere between these scenarios. The first. λ) ≈ 1 − pk − (1 − p)k . r! i r≥1 i>r/2 The resulting error probabilities can be made very small by ensuring a coverage λ that is large enough. Given a frequency LA = l of A alleles in the pool. λ) = l=1 qc (l. Let F(P ) (b. A natural way to proceed for individual sequencing is to assume that the most frequently read base for an individual is the true one. we assume furthermore that the three possible wrong bases are chosen with the same probability. For the dependent case. r! i r≥1 i>r/2 We will now derive the probability of wrongly detecting a SNP due to sequencing errors. each with probability ǫ/3. λ. Pooling Experiment. γ) = i=0 γ exp(−γ) denote the probability that i! a Poisson random variable with parameter γ is at most b. we obtain that (2) q(p. In the independent case. In this situation. Obviously a more sophisticated rule will be needed when sequencing diploid individuals. We now assume that a pooled sample of size n is sequenced with the same expected total number kλ of reads per locus as for sepi b arate sequencing. k. ǫ) = 1 − 1 − ǫ (1 − ǫ)(r−i)  exp(−λ) . k. The second assumes independent errors such that each sequencing error leads to an independently chosen wrong base. assumes complete dependence such that sequencing errors at a given position always lead to the same incorrect base. Concerning the dependence of the reading errors. an error is made by choosing one of the three incorrect bases at random. λ) k l p (1 − p)k−l . the probability of detecting a SNP is approximately k−1 (1) q(p. l For large values of λ. we consider two extreme scenarios.NGS EXPERIMENTS WITH POOLED SAMPLES 5 l of A alleles in the sample. as well as the probability ǫ that a single read for a given base is incorrect and furthermore on whether the errors are independent. The probability that this leads to the wrong decision depends on the number of reads available for the locus under investigation. we obtain by conditioning on the (Poisson) number of reads for an individual at a locus k    r r i λ (d)  (3) qe (k. λ. more pessimistic scenario.

ǫ. n Now this leads to the probability of detecting a SNP n−1 (6) l=1 1 − F(P ) (0. λkǫ))[1 − F(P ) (0. With independent sequencing errors. λ. (n − l)kλ ) . We consider a locus with expected relative frequency p in the population. λ. they are easily confounded with low frequency alleles. lkλ ) n 1 − F(P ) (0. A common strategy to reduce the high probability of sequencing errors is to consider only SNPs that are detected in at least b reads. In the dependent scenario. lkλ ) n 1 − F(P ) (0. . b) = 3 1 − F(P ) (b − 1. The variance of Rc can be obtained as Var(Rc ) = Var MA J = E Var MA |J J + Var E MA |J J . Then the probability that a specific locus is read for J = j of the k individuals is rj. e Allele frequency inference. λk(1 − ǫ)) is very close to one and can be omitted without changing the results much. the probability of wrong SNP detection equals the probability (8) p(d) (k. an upper bound for the probability of falsely detecting a SNP is given by (9) p(i) (k.6 NGS EXPERIMENTS WITH POOLED SAMPLES A and one a allele as (5) 1 − F(P ) (0. we again derive the probability of wrongly detecting a SNP under two scenarios for the sequencing errors. Suppose first that the individuals are sequenced separately with an expected coverage of λ. Requiring a minimum number b of reads in our context. the term 1 − F(P ) (0. j Given that reads are available for J = j out of the k individuals. lkλ ) n 1 − F(P ) (b − 1. (n − l)kλ ) n n l p (1 − p)n−l . b) = (1 − F(P ) (b − 1. the relative frequency of A alleles is Rc := MA /j. λk(1 − ǫ))] e of making at least b sequencing errors and getting at least one correct read. the probability of detecting a SNP changes to n−1 (7) l=1 1 − F(P ) (b − 1. (n − l)kλ ) n n l p (1 − p)n−l l which occurs with a proportion p in the population. l As with individual sequencing. ǫ. If the expected number of reads λk is fairly large. λkǫ/3) . As sequencing errors are common in NGS.k := k (1 − e−λ )j e−(k−j)λ .

Furthermore E[ MA |J] = p and therefore Var E[ MA |J] = 0. of A alleles. we obtain Var(Rp ) = E n−1 p(1 − p) p(1 − p) + . this variance component can be kept small by choosing pools of large enough size. n + Var E MA |U M . the estimators show some bias. in particular for small pools. we assume MA (Ma ) reads of the A (a) allele from this sample. and with U = (M. (2009).NGS EXPERIMENTS WITH POOLED SAMPLES 7 Now given J. According to our simulations shown in the results section however. However. p. J MA |J J = p(1 − p)/J. Estimating population genetic parameters. Notice that the variance for the pooling experiment increases when individuals contribute unequal amounts of probe material. We investigate the influence of the two sequencing strategies on the accuracy of these summary statistics. Together J J Var(Rc ) = p(1 − p)E(1/J) ≥ p(1 − p)/k. This leads to M = MA + Ma reads for the site under investigation. For a large enough expected coverage λ we get E(1/J) ≈ 1/k and E (1/M) ≈ 1/(kλ). LA ). We now turn to the pooling experiment. LA /n). According to our simulations.p) distributed and Var This leads to MA E Var |J = E(1/J)p(1 − p). The relative frequency of the A allele estimated from the sample is then given as Rp = MA /M. We again decompose the variance into Var(Rp ) = Var Now Var MA |U M MA M = E Var and E 1 M MA |U M = LA . With LA again denoting the number of A alleles in a pooled sample of size n. both summary statistics show a significantly smaller variance for pooled samples. = 1 LA n−LA M n n MA |U M Together. The computation of variances for these estimators would depend on the specific assumptions of a probability model for the quality scores. Allele frequency estimators for pooled samples that also take into account quality scores of the individual reads have been discussed in Holt et al. According to our model M is Poisson P ois(kλ). (10) Var(Rc ) E(1/J) It is convenient that the ratio does not depend on the population proportion p of A alleles anymore. MA |U is binomial B(M. assuming again a population proportion. MA is binomial B(J. The reason for the bias is that multiple . n n In order to see which experimental setup leads to the smaller variance. Two widely used summary statistics in population genetics are Tajima’s π and Watterson’s θ. we consider the ratio 1 1 E M n−1 + n Var(Rp ) n = .

l . such an estimate could ˆ∗ be used to correct θπ . These bias corrected estimators will then be compared with those obtained from individual sequencing. equation (1)) for the standard n ˆ∗ experimental setup. In order to also correct for sequencing errors. Since this bias correction only depends on the size n of the pool and not on the coverage by reads. Based on the expected values of Tajima’s π and Watterson’s θ. leading to θπ = n−1 l θπ.8 NGS EXPERIMENTS WITH POOLED SAMPLES reads of the same sequence are entering the normalizing constant as independently sampled sequences. Another way to take into account sequencing errors is ˆ to require a minimum minor allele frequency b for including a segregating site in the analysis. θπ −2 n−1 µerr will be unbiased. and if a minimum minor allele frequency is required in order to make sequencing errors rare. and to ignore sequencing errors subsequently. Introducing µerr will obviously add ˆ to the variance of the resulting estimator and the overall performance will depend on the accuracy of µerr . the effect of omitting singletons has been studied by Knudsen and Miyamoto (2009) as well as Achaz (2008). if we neglect sequencing errors. if the estimators are computed in a standard way for pooled samples. Let ∆ij denote the number of differences between the sequences i and j at this locus that are selected randomly with replacement from the pool of n individuals. = θ n n ˆ Therefore n−1 θπ will be unbiased. We do not consider sequencing errors for the moment. Now for this locus ˆ Eθπ = E i=j ∆ij / M 2 (11) = E∆IJ = E[∆IJ |I = J]P(I = J) + E[∆IJ |I = J]P(I = J) = 0 + θP(I = J) n−1 . Analogous to Achaz (2008. For individual sequencing. we first consider a locus for which the coverage is equal to M. The idea is that sequencing errors will be rare if b is sufficiently large.) We first derive a bias correction for Tajima’s π and start by considering a locus for which M reads are available. Sequencing errors also lead to bias. a bias corrected version of Tajima’s π for the entire sequence can be ˆ obtained by adding up individual values of θπ. if µerr is an unbiased estimate ˆ ˆ of the number of reading errors per sequence. and focus on the bias that is caused by possibly reading the same sequence more than once. two approaches seem feasible. we introduce modified normalizing constants that make the resulting estimators unbiased under neutrality. (See the RESULTS section.l for all loci and then multiplying n n ∗ ˆ ˆ by n−1 . ˆ(b) Again. If an unbiased estimate for the sequencing errors is available. this needs to be taken into account. Let θπ denote the version of Tajima’s π where the minor allele frequency is required to .

and Yn the number of A alleles in the pool. and let furthermore XM denote the number of A alleles i=1 among the reads. θπ may be written as ˆ(b) θπ = M 2 −1 M −b Km m(M − m) m=b for a locus for which M reads are available (see section 1. Let cn = 1/ n−1 i−1 . E(θπ ) = θ n For b > 1 the sum does not simplify much. With cn θ being the expected number of segregating sites in the pool. the error drops to 0. Then M P (XM = m|Yn = r) = (r/n)m (1 − r/n)M −m m and under neutrality P (Yn = r) = r −1 /cn . r=1 However. Summarizing.e. In this case n−1 P (XM = m|Yn = r)P (Yn = r) ≈ c−1 n r=1 1 m for 1 ≤ m ≤ M − 1 and therefore (13) M − 2b + 1 .02% for b = 2 and 4 · 10−5 % for b = 3. but can be computed and turned into the bias correction factor M [ 2 m=b M −b n−1 m(M − m)P (XM = m|Yn = r)r −1 ]−1 . n−1 ˆ(1) . Notice that θπ = θπ for b = 1. an accurate approximation for (12) can be obtained by assuming that n is large compared to M. then the relative error is only 0. For n = 200 and M = 50. i. (12) ˆ(b) E(θπ ) = M 2 −1 M −b n−1 cn θ m=b r=1 m(M − m)P (XM = m|Yn = r)P (Yn = r) For b = 1. even if the pool size n is only moderately larger than the number of reads M. the resulting simple bias correction factor MM −1 turns out to provide −2b+1 very good approximations.NGS EXPERIMENTS WITH POOLED SAMPLES 9 ˆ(b) ˆ be at least b.4 in Durrett (2008)).4% when M = 10 and n = 20. if singletons are omitted (b = 2). we propose the following bias corrected version of Tajima’s π : n ˆ θ for b = 1. ˆ(b) E(θπ ) ≈ θ M −1 . n−1 π ˆ(b)∗ = (14) θπ M −2b+1 ˆ(b) θπ for b > 1. M −1 For b > 1. straightforward calculations reproduce (11). With Km denoting the number of ˆ(b) sites where the derived allele A has frequency m. Indeed.

We consider a version of Watterson’s estimator that requires a minimum minor allele frequency b.l separately for each locus l. M. n−1 1 r=1 [F(B) (M − b. Analogous to (12). Then L (19) ˆ(b)∗ θW = l=1 ˆ(b)∗ θW. and obtain the bias corrected Watterson ˆ(b)∗ estimate θW. and versions that protect against sequencing errors can be obtained by choosing b > 1. again first focusing on a locus with coverage M. x F(B) (x.l . M. Recall furthermore that cM = for b ≥ 1 is given as (18) ˆ(b)∗ θW = M −1 −1 i=1 i . For this purpose. r/n)] r ˆ(b)∗ As with Tajima’s π. r=1 Let F(B) (x. p) denote the probability that a binomial random variable X satisfies P (X ≤ x) for M trials with success probability p. θW can be easily adapted to work with longer sequences. partition the sequence into L loci such that for each locus a constant number of reads Ml is available. an estimator for the scaled mutation parameter per base results. ˆ(b) Then a bias corrected version of θW ˆ(b) θW cM . M. ˆ Dividing θπ by the total length of the considered sequence. In particular for p = r/n. we obtain that conditional on the number of reads M for the locus (17) cn ˆ(b) E(θW |M) = θ[ cM m=b M −b n−1 P (XM = m|Yn = r)P (Yn = r)]. For b = 1 we use all segregating sites. r/n) = i=0 M i r n i 1− r n M −i . We now derive a bias correction for Watterson’s estimator. Let Sb denote the number of segregating sites found in the M sequence reads from the pool for which the minor allele frequency is at least b. r/n) − F(B) (b − 1. M. if b is large enough. simply take the sum over the individually bias corrected estimates L (15) (b)∗ ˆ(b)∗ θπ = l=1 ˆ(b)∗ θπ.10 NGS EXPERIMENTS WITH POOLED SAMPLES To obtain an overall estimate based on L loci with possibly unequal coverage Ml (1 ≤ l ≤ L). Then Sb ˆ(b) θW := M −1 (16) i=1 1/i provides protection against sequencing errors. .l .

by quality filtering). In an individual based sequencing strategy. In the methods section. in some cases. While pooling experiments provide a higher probability of SNP detection in most cases. most of the sampling variation comes from the selection of individuals used for DNA sequencing. Irrespective of the assumed model of sequencing errors (see methods section for further details). Allele frequency inference. On the other hand. it is expected to be less efficient. a high probability of sequencing errors makes SNP calling from pools highly unreliable. The advantage of the pooling approach is that this sampling error can be dramatically reduced by including a large number of individuals in the pool. For a . As long as not chosen too small. if both the coverage is small and a a high minimum number of reads is required. This is not entirely unexpected. Interestingly. and a classical experiment with individual sequencing Rc . We start by discussing the situation where individuals contribute equal amounts of probe material and refer to the last paragraph of the section for the case when this assumption is violated. Dividing θW by the total length of the considered sequence. the size of the pool seems to play a less important role. We compared the efficiency of SNP discovery using an approach in which each individual is sequenced separately with a pooling approach. if sequencing error rates are reduced (e. The higher the expected coverage. In population genetics. the allele frequency spectrum is of central interest. the more inefficient individual sequencing becomes.g. an estimator for the scaled mutation parameter per base results.NGS EXPERIMENTS WITH POOLED SAMPLES 11 ˆ(b)∗ provides an estimate of the overall scaled mutation parameter. we found pooling to result in fewer erroneous SNP calls than individual sequencing. a second level of sampling error arises in the pooling approach from the fact that not all chromosomes in the pool are sequenced and some chromosomes may be sequenced more than once. Figure 2 addresses the problem of wrongly identifying a sequencing error as a SNP. The panels of Figure 1 show that the comparative efficiency of pooling depends both on the expected coverage and on the minimum number of reads for allele calling used for error protection. a suitable lower bound on the minimum allele frequency for detecting a SNP makes pooling very reliable for the identification of SNPs. On the other hand. For many biological applications SNP genotyping provides a cost effective approach. and SNP discovery is the first step required. since an increased number of reads required for the inference of the minor allele reduces the probability of detecting SNPs in a pooling experiment. we obtained expression (10) for the ratio of the variances of the estimated relative allele frequency both for a pooling experiment Rp (pool size n). Results SNP Detection. Estimating the allele frequency spectrum of a population is subject to sampling variation.

this equation can be approximated by the following quick rule of thumb: Pooling will lead k to a smaller variance for those experimental setups that satisfy 1/λ + n < 1 or equivalently n/(n − k) < λ. Alternatively. the standard deviation entering the test statistic will differ by the square root of (10) between a pooling and a classical experimental setup. Thus a case where pooling provides a better estimate of the allele frequency is when the pool contains more than twice the number of separately sequenced individuals and the coverage λ per separately sequenced individual is at least two. corresponding to an individual sequencing project with 10 instead of 20 individuals. whenever the variance ratio is smaller than one (see (10)). we took random sub-samples of size k = 10 from each sample.out) where 500 is the number of sequences generated. Figure 3 compares the pooling approach to sequencing of individuals when both methods provide the same accuracy for allele frequency estimates. (see Sham et al. It should be noted however that the variance of the pooling experiment will become larger if individuals contribute unequal amounts of probe material. For separate sequencing. (. each at an expected coverage λ. With . the superiority of the pooling approach could be expressed by the reduction of sequencing costs. the shift of the expected value of the test statistic under the alternative will be twice as large in a pooling experiment: Overall pooling will be the more powerful approach. thus simulating separate sequencing of 10 individuals each with an expected number λ of reads. This issue will be addressed in the last paragraph of this section. 2002). (2002)). then pooling would give the same accuracy with half the sequencing effort. In genome-wide association studies (GWA’s). for instance./ms 500 100 -t 10 > ms. Estimating population genetic parameters. For this purpose. using the ms software (Hudson. Suppose that k individuals are sequenced separately. A possible approach is to test whether alleles have different frequencies in two pools that differ with respect to the trait of interest.12 NGS EXPERIMENTS WITH POOLED SAMPLES large enough expected coverage λ and with k individuals sequenced. We now compare the estimation of the scaled mutation parameter using Watterson’s θ and Tajima’s π under our two experimental setups. Since the ratio of variances (10) does not depend on the allele frequencies in the sub-populations. we simulated 100 samples under neutrality with mutation parameter θ = 10. Figure 3 clearly indicates that larger pool sizes increase the advantage of sequencing pools. If. the association between allele frequencies and traits (diseases) is investigated. If the square root of (10) is 1/2 (say). Then k ∗ indicates the cost in single genome sequencing equivalents that results in the same accuracy as sequencing k genomes individually. For larger pools smaller values of λ will be sufficient. A higher sequence coverage (λ) for sequencing of individuals further improves the cost effectiveness of pooling. So far we compared the individual based and pooling strategy only for the same number of sequenced reads. k = 20 and k ∗ = 10.

multiple reads of the same chromosome become more common. Let b denote the minimum required minor allele frequency. From this pool. Thus the median amount of probe material is always equal to one. the deviation factors have been chosen independently for each individual contributing to the pool according to exp(Xi ). where Xi (1 ≤ i ≤ n) are normal N(0. this means that the respective individual will have a 50% higher chance of being sequenced. than another with a factor of exp(Xi ) = 1. we used the bias correction (14) for individual loci and added the estimates across loci using (15)). (Figure 5. We set the expected amount of probe material to one. log(scale)) random variables.NGS EXPERIMENTS WITH POOLED SAMPLES 13 pooling. Neglecting sequencing errors for the moment. This affects the bias and the variance also for our considered population genetic summary statistics. provided that the size of the pool is large enough. the bias has been corrected using formula (18) for each locus. since frequency one minor alleles receive a low weight in the calculation of π.) We now investigate the pooling approach when including a protection against sequencing errors by removing all segregating sites where the minor allele has frequency x satisfies x = 1 or alternatively x ≤ 2. Not unexpectedly. we simulated a scenario involving unequal amounts of probe material. the normalizing constants have been adapted in order to avoid bias. The increase in ˆ(b)∗ the variance of Tajima’s θπ is much smaller. For Tajima’s π.8 means a 20% decreased chance of being read. . In experiments that rely on PCR amplification. Figure 5 shows the relative advantage of pooling conditional on different minimum minor allele frequencies. it turns out that the pooling approach with bias correction leads to more accurate estimates of θ and π. will be over-represented. a value of 0. Individuals for which a larger DNA amount has been included in the DNA pool. we took samples of size n out of the 500 simulated sequences. the reducˆ(b)∗ tion in variance is now somewhat smaller for Watterson’s θW . If the deviation factor has a value of exp(Xi ) = 1. One obvious source of error in the pooling approach is the heterogeneity in DNA amounts due to measurement errors. which affects the accuracy of the estimates negatively. For small pools. Similarly. Unequal amounts of probe material. the heterogeneity can be expected to be particularly strong. which potentially causes a change in allele frequency estimates.5. Again. Pooling still leads to a decreased variance under neutrality as long as the pool size is large enough. and allowed for log-normally distributed multiplicative deviations from this expected value. reads were taken independently for each locus l by making a random number of draws Ml with replacement. The quantities Ml were chosen according to a Poisson distribution with expected value kλ. To investigate the sensitivity of population genetic estimates based on pooling experiments. More specifically. For Watterson’s θ.

we showed that sequencing . Figure 6 displays the effect of heterogeneity in probe material on the bias and the variance of Tajima’s π and Watterson’s θ. the performance (measured in terms of the MSE) changes only marginally. Interestingly. We measured the deviation from the true θ by the mean squared error. This is illustrated in Figure 8. Heterogeneity in probe material also affects the accuracy of the estimated allele frequencies. this effect can be kept small. As further parameters we chose λ = 30. it is anticipated that it will become possible to sequence individual genomes at a cost that allows even small laboratories to perform population analyses on a genome scale. see Figure 7. This effect can be explained by shrinkage that leads to improved estimates of the mutation parameter θ by permitting for some bias (Futschik and Gach. With the arrival of the second generation sequencing technologies this race is about to come to an end . Although both bias and variance change noticably for higher levels of heterogeneity. As the amount of heterogeneity in the sample will usually be unknown. With further technological advances. In this study. the most abundant individual contributed about sixteen times the probe material of the least abundant individual. this is not possible as the costs are still too high. Thus the overall performance measured in terms of the mean squared error (20) MSE = Bias2 + Var changes only marginally even for a large level of heterogeneity (scale = 8). as this accounts for bias and variance. k = 10. as the variance of the estimator based on a pooled sample becomes larger. n ∈ [5. Discussion Over the past decades we have been witnessing a continuous turnover of molecular markers used in genetical research. by choosing a pool of a large enough size. these effects cancel out to a large extent. where it can be seen that pooling leads for large enough pool sizes eventually to smaller variances even for a high level of heterogeneity in probe material (scale =8). To a large extent this turnover has been driven by the advances in molecular biology and technology. slightly more than 30% of all individuals differed at least twofold from the median. 200]. even for a high level of heterogeneity in probe material (scale =8).14 NGS EXPERIMENTS WITH POOLED SAMPLES As our first scenario (scale = 2). it has come into reach to have full genomic sequences available for multiple individuals. for a pool of size n = 100. we applied the same bias correction as for equal amounts of probe material.rather than relying on a more or less representative fraction of the genome. 2008). In other words. We also simulated a more extreme scenario (scale = 8) where about 30% of the individuals differed at least eightfold from the median. However. Currently.

is supported by some recent studies using NGS to analyze DNA pools of selected genomic regions. de Maio. This is the first report systematically exploring the parameter range for which DNA pooling provides an advantage compared to individual genome sequencing. Kosiol.NGS EXPERIMENTS WITH POOLED SAMPLES 15 pools of individuals provides an excellent alternative that permits genome wide polymorphism surveys at very moderate costs. We are grateful to A. L403-B11) awarded to CS. In contrast to SNP arrays. 2008) observed a correlation of 0. ACKNOWLEDGEMENTS This work has been supported by a WWTF grant to AF and CS as well as FWF Grants (P 19832-B11. 2008) sequenced a complexity reduced DNA pool using the Illumina Genome Analyzer. Wolf on general discussions about pooling a for NGS. it will be possible to compare patterns of differentiation on a genomic scale. .. In population genetics. We anticipate that the analysis of DNA pools provides a wide range of applications. re-sequencing of DNA pools will always include the causative SNP and thus provide a higher statistical power. Vasem¨gi and J.. Special thanks to C. (Van Tassell et al. patterns of local adaptation and heterogeneity in gene flow among different genomic regions can be identified. Hence. Thus. For a subset of the identified SNPs. Kofler for helpful comments on earlier versions of the manuscript. Despite that SNP frequency estimates were undoubtedly affected by a substantial assignment error (Palmieri and Schl¨tterer. Also for association mapping DNA pools are very powerful (Sham et al. however. 2002). and R. (Van Tassell et al. Our study provides the basis for an adequate experimental design of future pooling experiments. Our result that NGS of DNA pools often provides a reliable and cost effective mean for genome-wide allele frequency estimates.. there is very little doubt that NGS is an effective tool to provide accurate genome-wide allele frequency estimates from DNA pools. N. they compared the allele frequency estimates from the Illumina sequencing to those obtained by genotyping the same individuals.67 between the two methods. and also thank the reviewers for helpful comments. 2009) due o to the short reads and the complexity reducing procedure.

. R. Hudson. Li. 3. M.. P. Chang. Smith.. Genetics 181. H. Haudenschild. C. W.. Teo. Estimation of allele frequencies from high-coverage genomesequencing projects. 2009. F. S. Mapping accuracy of short reads from maso sively parallel sequencing and the implications for quantitative expression profiling..1186/1471–2105–10–247. G. D.. C. K. Genetics 179. An analysis of strategies for discovery of singlenucleotide polymorphisms. Bader. D. Mol. L. 2008. Snp discovery and allele frequency estimation by deep sequencing of reduced representation libraries. A. Ronen. 212–221.. Parkhill. Testing for neutrality in samples with sequencing errors. Nair. On the inadmissibility of watterson’s estimate...... J. S. Nat. Schnabel. T.. G.16 NGS EXPERIMENTS WITH POOLED SAMPLES References Achaz. Gordon.. Jiang. Theoretical Population Biology 73. Matukumalli.. Methods. P. Genet.. 2421–2431.. Van Tassell.. M. D... R. P. disequilibrium coefficients. Dougan. Springer. Warren. New York. P. Moore. B. Durrett. R. 1243–1253.. 2000. 1409–1424. 337–338. 2008.. Futschik. O.. Accurate and fast methods to estimate the population mutation rate from error prone sequences... Estimation of nucleotide diversity. Nature Rev. C... Biol. Genet. M. C. Sonstegard. K. 2008. L. M. 2009. Genetics 182. Lynch. Generating samples under a wright-fisher neutral model of genetic variation. Palmieri. Detecting snps and estimating allele frequencies in clonal bacterial populations by sequencing pooled dna.. doi:10.. R. T. S. Y. M. P. P. 2002. J.. G. K. I.. 2002. R. Knudsen.. 2074–2075. S29–S35.. F. Epidemiol. O.. 295–301. Eberle. C. S. Bioinformatics 25. 187–197. Probabiliy models for DNA Sequence Evolution. S. S. BMC Bioinformatics 10:247. Evol.. 2009. Erlich. 862–871. R... e6323+. Gach... Bioinformatics 18.. Schl¨tterer.. J. T. Rooks. Wain. and mutation rates from high-coverage genome-sequencing projects. 25. Dna sudoku-harnessing high-throughput sequencing for multiplexed specimen analysis. 19. J.. D. M. Genome Research 19.. L.. Dna pooling: A tool for large-scale association studies. T. Y. Holt. K. Hannon. 2008. F. PloS one 4. C. Sham. Craig. M.. 2008. M.. O. Marjoram. Lawley. 2009. Navon. 2009. Tavar´. J. Kruglyak. Population genetic inference from e resequencing data. N. Taylor. A.. S. 2009. Lynch. S. Miyamoto..

p) binomial cumulative distribution function (F(B) (x. Description of our Notation Symbol or Notation Description k number of haploid individuals used for separate sequencing λ expected number of times a locus is read for an individual using separate sequencing n size of the pool in a pooling experiment J random number of individuals for which reads are actually available at a particular locus with individual sequencing (J ≤ k) M random number of reads for a particular locus in a pooling experiment (E(M) = kλ) p relative frequency of the allele of interest in the population F(P ) (b. M.NGS EXPERIMENTS WITH POOLED SAMPLES 17 Table 1. M. γ) = b γi i=0 i! exp(−γ)) F(B) (x. ˆ(b)∗ θW bias corrected version of Watterson’s θ for a pooling experiment when the minor allele frequency is required to be at least b ≥ 1. For b = 1. ˆ(b)∗ ˆ∗ θπ = θπ . γ) Poisson cumulative distribution function (F(P ) (b. . p) = x M i M −i . i=0 i p (1 − p) (b)∗ ˆπ θ bias corrected version of Tajima’s π for a pooling experiment when the minor allele frequency is required to be at least b.

2 0. Probability of detecting a SNP with relative minor allele frequency p in the population when a certain minimum number of reads is required as a detection threshold. ≥ 6).0 0.6 0.6 0.6 0.a.6 0.4 0.1 0. the expected total coverage is kλ. Expected coverage λ = 5(A). red-dotted: minor allele frequency (m.5 freq in pop.8 0.0 0.18 NGS EXPERIMENTS WITH POOLED SAMPLES A: λ=5.1 0.0 1.f.2 0.8 prob of SNP detection prob of SNP detection 0.2 0.0 1.8 prob of SNP detection prob of SNP detection 0.0 C: λ=20.0 0. n=200 0. freq in pop.2 0.5 0. Pool sizes are either 50 (left column) or 200 (right column).2 0.8 0.3 0.0 A: λ=5.5 freq in pop.0 0. B: λ=10. freq in pop.3 0.5 0.2 0.4 0.0 1.4 0.1 0.6 0.) at least 2.3 0. freq in pop. n=50 1.f.0 B: λ=10.2 0.0 0.3 0.4 0.4 0.3 0. 20(C) per individual.2 0. n=200 0.0 0. n=50 1.8 prob of SNP detection prob of SNP detection 0.0 0.4 0.4 0.2 0.4 0. green long dashed: : m.0 0.3 0.2 0.5 freq in pop. 10(B).0 0.2 0.1 0.1 0. ≥ 4.0 0.1 0.0 0.4 0.2 0.0 0. Solid black line: Experiment where k = 10 haploid individuals are sequenced separately. Figure 1. For pooling experiments.a.4 0. n=50 1.8 0.f.5 0. C: λ=20. .4 0.6 0.a. n=200 0. The colored lines indicate the probabilities for sequencing experiments using a pooled sample (purple-dashed: no error correction.4 0. blue-dash-dot: m.

Solid black line: Experiment where k = 10 haploid individuals are sequenced separately and the most frequently read base at a position is chosen for the sequenced individual.i.5 −2.a.a. ≥ 6). errors i. log_10 sequencing error prob.0 −6 −4 −2 −3. green long dashed: : m.0 log_10 sequencing error prob. log_10−prob.f.NGS EXPERIMENTS WITH POOLED SAMPLES 19 A: λ=5.0 −1.5 −3.0 −2.0 log_10 sequencing error prob. of false SNP calling −4.0 −2. errors i.d.0 −2. See the methods section for a more detailed description of these scenarios.0 −2 −4 −6 −8 −8 −4.0 −1.5 −2.f.0 −3. log_10−prob.5 −2. in dependance on the logarithm of the sequencing error probability.i. For pooling experiments.0 −6 −4 −2 −3.5 −1. errors dependent 0 0 B: λ=10.5 −3.0 −2. the expected total coverage is kλ.0 −1. of false SNP calling log_10−prob. log_10 sequencing error prob.0 −2. Expected coverage λ = 5(A).0 −2.0 −1. The colored lines indicate the probabilities for sequencing experiments using a pooled sample (purple-dashed: no error correction.5 −1. of false SNP calling −4.0 −2 −4 −6 −8 −8 −4. Log-probability of falsely detecting a SNP at a nonsegregating site.0 −6 −4 −2 −3. we plot results for completely dependent (left column) and independent sequencing errors (right) instead.5 −1.5 −2.5 −3.5 −1. 20(C) per individual. of false SNP calling log_10−prob. B: λ=10.d.0 −3.i. log_10 sequencing error prob.0 −2 −4 −6 −8 −8 −4. errors i. of false SNP calling log_10−prob. Figure 2.a.0 −3.d.) at least 2.5 −2. C: λ=20.5 −1. red-dotted: minor allele frequency (m. 10(B). blue-dash-dot: m.5 −3. log_10−prob.0 −1.5 −1. . of false SNP calling −4.f.5 −3. ≥ 4. errors dependent 0 0 A: λ=5.0 −1.5 −2. Since the pool size is not relevant in this context.0 log_10 sequencing error prob. errors dependent 0 0 C: λ=20.5 −3.

) . (“o”: pool size n = 50. “+”: n = 100. Sequencing effort k ∗ of a pooling experiment in order to get allele frequency estimates with the same accuracy as in a standard experimental setup where k individuals are sequenced separately. “x”: n = 500.20 NGS EXPERIMENTS WITH POOLED SAMPLES (a) lambda=5 (b) lambda=10 o o o 40 corresponding k with pooling 30 o o o o o o o o o + + o + + o + o + + o + + o + + o xx + o + xxx + o + xxxx o ++ o + + xxxxxx o + o + +xxxx o o + o + xxx o o + +xx + +x o + x o o x ++ x o +xx o +xx o o +x+x x+x corresponding k with pooling 20 30 40 20 o o o o o o o o o o o o o ++ ++ o o ++ o ++ ++ + + xxxxxxxx o o o o o ++++ o + xxxx+ o +xx+ + xxxxxx o o o + + o o + o +xxxx o + + + xx o o o + + + + xx o o o xx ++++ xxxxxx 10 0 5 10 15 20 k 25 30 35 40 0 10 5 10 15 20 k 25 30 35 40 Figure 3.

There is a considerable bias. if n is small compared to kλ.) .NGS EXPERIMENTS WITH POOLED SAMPLES 21 Watterson’s theta Tajima’s pi 10 8 mean estimate mean estimate 0 50 100 200 6 4 2 0 0 0 2 4 6 8 10 50 100 200 n n Figure 4. (For Tajima’s π. illustrating the need to use a bias correction with the estimates. True value θ = 10 (green line). Solid black line: λ = 30. Expected value of the estimates obtained from pooled samples depending on the pool size n: Watterson’s θ and Tajima’s π. red dashed line: λ = 5. the bias does not depend on λ.

m. .8 1.f.a. Pooling always performs better. ≥ 2. and with a protection (m.2 1.f.4 0.2 1.) The horizontal green line denotes the break even ratio of one. Standard setup with k = 10 individuals sequenced separately. m.a.a.22 NGS EXPERIMENTS WITH POOLED SAMPLES Watterson’s theta.a.8 1.a.a.>=3 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Figure 5.f.a.>=1 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Watterson’s theta.8 1.8 1.f>=1 0.6 0. m.2 1.a.4 0. above the stated threshold are included. We consider pooling both without (minor allele frequency (m.8 1. Solid black line: λ = 30. as soon as the size of the pool exceeds the number of separately sequenced individuals. m. ≥ 3) against sequencing errors. m.>=2 0.a.4 0. red dashed line: λ = 5.f.f. (Only segregating sites with minor allele frequency m. Variance ratio (V arpooled /V arstandard ) of the bias corrected version of Watterson’s θ and Tajima’s π depending on the pool size n.f. m.f.8 1.2 1.f.f. where both the pooled and the classical experiment leads to estimates with equal variances.6 Tajima’s pi.6 0.2 1.a.>=2 variance ratio variance ratio 0 50 100 150 200 0 50 100 150 200 n n Watterson’s theta.>=3 0.4 0.6 Tajima’s pi.4 0.6 0.2 1.6 Tajima’s pi. m.4 0.) ≥ 1).

sd (dashed) Tajima’s pi 10 0 2 0 8 6 4 0 2 0 50 100 n 150 200 4 6 8 50 100 n 150 200 Watterson’s theta squared bias (solid). σ = log(scale). var (dashed) squared bias (solid). The second row contains the squared bias and the variance. (Further parameters: λ = 30. sd (dashed) bias (solid). red lines: high heterogeneity (scale =8). that add up to the mean squared error.NGS EXPERIMENTS WITH POOLED SAMPLES 23 Watterson’s theta 10 bias (solid). log-normal parameters: µ = 0. In the first row bias and standard deviations are plotted for the population genetic estimates.) . Black lines: moderate heterogeneity (scale =2). Bias (solid lines) and variance (dashed lines) of Watterson’s θ and Tajima’s π depending on the extent of heterogeneity in probe material. k = 10. var (dashed) Tajima’s pi 50 30 0 10 0 50 100 n 150 200 0 10 0 30 50 50 100 n 150 200 Figure 6.

0 0 50 100 n 150 200 0.5 0.5 1.5 0 50 100 n 150 200 Watterson’s theta. Mean squared error ratio (MSEpooled /MSEstandard ) of Watterson’s θ and Tajima’s π depending on the pool size n and for λ = 30. Red dashed line: the amount of probe material differs from individual to individual according to log-normal factors.0 0. For a scale value of two (for instance).5 1.5 0 50 100 n 150 200 Figure 7.24 NGS EXPERIMENTS WITH POOLED SAMPLES Watterson’s theta. Solid black line: The same amount of probe material is available for all individuals. The median factor is always one.0 1. scale = 2 1. scale = 8 Tajima’s pi. 16% of probes involve more than double the median probe amount.0 0 50 100 n 150 200 0.0 0.5 0. (Log-normal parameters: µ = 0.0 0. scale = 2 Tajima’s pi. and with a scale of two. σ = log(scale). both curves are nearly identical.) .5 MSE ratio MSE ratio 1. and another 16% contain less than one half the median amount.5 MSE ratio MSE ratio 1. 8}. scale = 8 1.0 0. For the top two panels. scale ∈ {2. about 32% of all probes deviate by a factor of more than the value given by scale.0 1.

Variance ratios (V arpooled /V arstandard ) when estimating allele frequencies in the case where the amount of probe material differs from individual to individual according to log-normal factors. Ratios smaller than one indicate that pooling leads to estimates with a smaller variance. s = 8 (blue dash-dotted line) (Log-normal parameters: µ = 0. scale ∈ {2. σ = log(scale). and another 16% contain less than one half the median amount. Scales: s = 2 (red dashed line).0 0. 8}.5 2. and with a scale of s. For the scale value s = 2 (for instance) 16% of probes involve more than double the median probe amount.NGS EXPERIMENTS WITH POOLED SAMPLES 25 variances ratios 0.) . about 32% of all probes deviate by a factor of more than s.5 1. Individual sequencing is carried out for ten individuals with an expected coverage of λ = 10. The median factor is always one.0 1.0 2.5 3. 4. s = 4 (green dotted line).0 100 200 300 n 400 500 Figure 8.

Sign up to vote on this title
UsefulNot useful