Professional Documents
Culture Documents
PII: S0040-5809(16)30025-9
DOI: http://dx.doi.org/10.1016/j.tpb.2016.06.004
Reference: YTPBI 2539
Please cite this article as: Kimmel, M., Wojdyła, T., Genetic demographic networks:
Mathematical model and applications. Theoretical Population Biology (2016),
http://dx.doi.org/10.1016/j.tpb.2016.06.004
This is a PDF file of an unedited manuscript that has been accepted for publication. As a
service to our customers we are providing this early version of the manuscript. The manuscript
will undergo copyediting, typesetting, and review of the resulting proof before it is published in
its final form. Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
*Manuscript
Abstract
Recent improvement in the quality of genetic data obtained from extinct
human populations and their ancestors encourages searching for answers to
basic questions regarding human population history. The most common and
successful are model-based approaches, in which genetic data are compared
to the data obtained from the assumed demography model. Using such ap-
proach, it is possible to either validate or adjust assumed demography. Model
fit to data can be obtained based on reverse-time coalescent simulations
or forward-time simulations. In this paper we introduce a computational
method based on mathematical equation that allows obtaining joint distri-
butions of pairs of individuals under a specified demography model, each of
them characterised by a genetic variant at a chosen locus. The two individu-
als are randomly sampled from either the same or two different populations.
The model assumes three types of demographic events (split, merge and mi-
gration). Populations evolve according to the time-continuous Moran model
with drift and Markov-process mutation. This latter process is described by
the Lyapunov-type equation introduced by O’Brien and generalized in our
previous works. Application of this equation constitutes an original contri-
bution. In the result section of the paper we present sample applications
of our model to both simulated and literature-based demographies. Among
1. Introduction
With increasing availability of genome-wide data, including data from
extinct species, new questions about history of species and populations are
asked and answered. Only in the recent years we have obtained genome
sequences from three main hominins: Neanderthals [1], ancient modern hu-
mans [2] and high quality 30-fold coverage of Denisovans [3]. These data
preserve traces of information about past events occurring in the population
history, such as founder effects, bottlenecks, migrations, admixtures and so
forth. Historically, there were two distinct approaches [4] to retrieve this
knowledge: the phylogeographic approach based on the gene-tree analysis,
and the summary statistics that concentrated on comparing some aspect
of data, such as the numbers of variable sites in or between populations.
Recently, the model-based analysis [5] combines both approaches. In this
approach, estimates of the parameters obtained from assumed demographies
are directly compared to real genetic data. An important version of this
approach is the reverse-time coalescent approach. In its Bayesian version,
the posterior probability of a model parameters data can be obtained and in
this way different models can be compared. This approach has been widely
developed over recent years [6, 7]. Bayesian methods are frequently computa-
tionally intensive [8]. Nevertheless, the Approximate Bayesian Computation
(ABC) has been used a lot to study models with more than 3 populations
and generally high complexity [9].
2
Over the last few years, many theories regarding human ancient history
have been tested using the methods described. For example, in ref. [10]
authors discuss ancestry of the Eastern Europe Neolithic farmers and infer
their Near East affinities. In reference [11] different models of human col-
onization history have been tested. As an example, genetic material of an
archaic hominin has been found in the Denisova Cave in Siberia and its ge-
netic history and relationship to modern humans from several population as
well as to Neanderthals have been discussed in ref. [12, 3] to infer human
population history from individual whole-genome sequences.
One approach to calculate estimates of the demography parameters is us-
ing model-based samples generated via forward-time generation-to-generation
or backward-time coalescent-based computer simulations. Extensive review
of such programs has been introduced in reference [13], and more recently in
reference [14] listing the following three simulation platforms as the leading
ones: quantiNEMO [15], ForSim [16] and simuPop [17]. The results obtained
from simulations need to be averaged over many runs in order to correctly
estimate population-specific values.
Here, we introduce a computational Demographic Network (DN) Model
that allows developing predictions of bivariate allele distributions in pairs
of individuals sampled from the same or two different populations, given a
potentially complicated demography. The two individuals are randomly sam-
pled from either the same or two different populations. The model assumes
three types of demographic events (split, merge and migration). Popula-
tions evolve according to the time-continuous Moran model with drift and
Markov-process mutation. This latter process is described by the Lyapunov-
type equation introduced by O’Brien and generalized in our previous works;
for references see the recent short monograph [18]. This equation has been
generalized in several directions, including process with recombination [19].
In the current version of the DN model, recombination has not been taken
into account. In some applications, the DN Model may complement the
simulation models. Although the DN is not an alternative to simulation
methods in the practical sense, it provides an algorithm to compute pairwise
distributions of alleles, in the case of haploid non-recombining loci such as
mitochondrial and Y-chromosome loci in humans. The DN Model leads to
exact pairwise distributions, and usually runs very fast.
The DN Model is not intended to replace the demographic inference pro-
cedures that infer demography from the site frequency spectrum, and can
summarize whole-genome data from multiple populations. However, it is
3
still interesting, to compare DN-network performance with other methods,
which have been summarized for example in ref. [20] or [13]. Many popular
methods developed so far limit, for computational reasons, the number of
populations in the model to a single population (BEAST [21]), two (IM [22],
IMa [23], [24]) or three (∂a∂i [20]) populations. Other methods with mul-
tiple populations have no subsequent migration after subpopulations split
[25]. Methods that consider multiple populations with migration often as-
sume limited population size growth scenario (LAMARC [26], MIGRATE N
[27]) or mutation model (no microsatellite model in BEAST [21] or constant
mutation rate [26]). Some approaches allow using only a limited set of sum-
mary statistics [28] or are computationally very intensive ([29] or [26] for
number of populations greater than 3). The authors of the Denisovans paper
[12] used Li and Durbin’s method based on the PSMC model [30]. The ap-
proach we use for comparison is simCoal [31] and its new versions fastsimcoal
[32] and fastsimcoal2 [33]. Our approach allows flexibility in the definition
of mutation model (see Example 3 in which we consider the microsatellite
model) and population growth scenario (we can accomodate any realistic
scenario). DN Model can be coupled with a local search algorithm, such as
the Nelder-Mead search algorithm used in Example 2, to draw conclusions
about the early Neolithic transition from hunting-gathering to farming.
4
evolves under genetic drift and mutation between two consecutive network
events. We assume that the allelic state Xa (t) ∈ A of the chromosome sam-
pled from population a at time t evolves as a time-continuous non-negative
(a)
Markov chain with transition intensity matrix Qa = {qjk }, 1 ≤ j, k ≤ NA ,
where qjk ≥ 0, j = k and ∀j k qjk = 0. We denote Rab (j, k)(t) = P [Xa (t) =
j, Xb (t) = k], where j, k ∈ A and 0 ≤ a, b < κi if t ∈ [ti , ti+1 ). We assume
that the matrix Qa remains unchanged between two demographic events but
may vary among different populations or different time intervals (the state
(a)
space remaining the same). By Pa (t) = {pjk (t)}, 1 ≤ j, k ≤ NA we denote
the transition probability matrix corresponding to matrix Qa . In the finite-
dimensional case (if A is finite) we obtain Pa (t) = eQa t . The same holds for
the uniform denumerable case, as explained in Section 6.10 of ref. [35].
Let us first consider two alleles randomly drawn from the same population
a, and assume that the MRCA of these two individuals with allelic types aj
and ak (aj , ak ∈ A, 1 ≤ j, k ≤ NA ) existed at time Tjk before the present
time t. Based on the Kingman-Moran coalescent [36],
with time-variable
− 0τ N (t−u)
du
population size Na (t) we obtain that P [Tjk > τ ] = e a . The MRCA
can be of any allelic type ai with probability πi (t − Tjk ) = P [X(t − Tjk ) = ai ]
and its descendants at the present time t have types aj and ak , respectively.
Then, summing over all possible values of i and integrating over the range
of Tjk results in [34]:
∞ t
1 du
Raa (j, k)(t) = πi (t − τ )pij (τ )pik (τ ) e− t−τ Na (u) dτ (1)
0 1≤i≤N Na (t − τ )
A
Expression (1) may be transformed into matrix notation and we can sep-
arate the evolution of the population in the time interval before t = 0 and
interpret it as the initial conditions [34]. This leads to the following equation:
t t
T − du 1 − τt Ndu(u)
Raa (t) = P (t)R(0)P (t)e 0 Na (u) + P T (t−τ )Π(τ )P (t−τ ) e a dτ,
0 Na (τ )
(2)
where Raa (t) = {Raa (j, k)(t)}, P T is the transpose of the matrix P and
Π(t) is a diagonal matrix with Π(t)ii = πi (t).
The case when the two alleles are drawn from different populations a and
b is simpler as there is no coalescence. In summary,
5
t du
−
PaT (t − ti )Rab (ti )Pb (t − ti )e ti Na (u)
+ Sa (ti , t) a=b
Rab (t) = (3)
PaT (t − ti )Rab (ti )Pb (t − ti ) a = b,
t t du
where Sa (ti , t) = ti PaT (t − τ )Π(τ )Pa (t − τ ) Na1(τ ) e− τ Na (u) dτ.
Rab (t) given by expression (3) is a mild solution [37] of the following
matrix differential equation related to the Lyapunov equation [38]:
Each population in the network is identified in the time interval [ti , ti+1 )
by a single index k ∈ 0, 1, . . . , κi − 1. Numbering starting from 0 results from
the convenience of software implementation. The index of the population
6
may change as a result of an event. If the index of the population is k,
0 ≤ k < κi , in the time interval [ti , ti+1 ), then we denote by k , 0 ≤ k < κi−1 ,
the corresponding index in the preceding time interval [ti−1 , ti ). If population
x splits at time ti , then
⎧
⎪
⎨k k ≤ x
k = x + 1 for a newly created population (6)
⎪
⎩
k +1 k > x.
i.e., the merged population has index k = x and the population with index
k = y is removed from the network.
Migration event at time t is described by matrix M(t) = {mxy (t)}, 0 ≤
x, y < κi with entries mxy , 0 ≤ mxy ≤ 1, mxx = 0, being the migration
rates from population x to y. Migration does not change the indices of the
populations. Such migration model allows for high versatility in describing
the migration pattern among populations. Continuous migration can be
appoximated by introducing either many migration events separated by short
time periods or a single migration event for each time period between other
network events. In the latter case it is needed to estimate single migration
event rates corresponding to the rates of the continuous migration model.
The population size of population k at time t ∈ [ti , ti+1 ) is denoted by
Nik (t), 0 ≤ k < κi .
7
notational ambiguity, we denote by Rab (j, k) the jth row, kth column entry
of matrix Rab .
As above, if the index of the population is denoted by k, 0 ≤ k < κi ,
in the time interval [ti , ti+1 ), then this population has index denoted by k ,
0 ≤ k < κi−1 , if it existed in the previous time interval [ti−1 , ti ). Matrices
Rab (ti ) = limt↓ti Rab (t) and Ra b (t−
i ) = limt↑ti Ra b (t) indicate the joint al-
lelic distributions in the two populations immediately after and immediately
before the event at time ti , respectively.
If a split event occurs at time ti , the allele on the chromosome in the
splitting population is inherited by two chromosomes, each in a different
progeny populations. Hence for this case we obtain the following identity:
If the event that occurs at time ti is a merge, the allele in the merged
population is sampled from the two merging populations x and y with re-
spective probabilities p and q = 1 − p, where p = N(i−1)x (t− −
i )/[N(i−1)x (ti ) +
−
N(i−1)y (ti )]. This results in the following formula for the joint distributions:
⎧
⎪
⎪ Ra b (t− i ) x = a , x = b
⎪
⎨pR (t− ) + qR (t− )
ab i yb i a = x, b = y
Rab (ti ) = − −
(9)
⎪
⎪ pR
ab i (t ) + qR a y (ti ) b = x, a = y
⎪
⎩ 2
p Rxx (t− + − 2 −
i ) + 2pqRxy (ti ) + q Ryy (ti ) a = x, b = y
+
where Rxy (t) = [Rxy (t) + Ryx (t)]/2.
A single migration event from one population x to another y is considered
a merge of the whole destination population y with a part of the population x.
Only the distributions of the destination population are affected. Assuming
that the event that occurred at time ti is described by the migration matrix
M(ti ), the size of the part of the population x contributing to the event is
given by mxy (ti )N(i−1)x (t−
i ).
A migration event in the network describes all possible migrations be-
tween two populations in the network. Therefore, a single population may
be affected by a number of different migrations taking place at the same time.
It leads to complex relationships between joint distributions characterizing
the populations involved in the migration events. We model such a migration
event using the following two-step scenario:
8
• Split each population in the network κi − 1 times in order to separate
all subpopulations migrating out of populations. The population size
ratio parameters used in the splits are given by the migration matrix
M(ti ).
• Merge the migrating subpopulations determined in the previous step
with proper destination populations. It is not necessary to apply any
particular ordering to these merges.
Suppose there are κi subpopulations present. Therefore, κ2i populations (and
κ4i joint distributions) need to be stored simultaneously. To optimize this
method we may separately consider the subpopulations migrating from each
original population, and then apply the splits and merges scenario κi times,
once for each determined joint subpopulation. Merge operations immediately
follow splits. As a result, at most 2κi + 1 populations are stored at the same
time.
If the migration event changes the population sizes, the modified values
of the population size satisfy the following formula:
κi−1
κi−1
Nix (ti ) = 1 − mxk (ti ) N(i−1)x (t−
i ) + mkx (ti )N(i−1)k (t−
i ) (10)
k=0 k=0
9
depends on the mutation model. We assume that the intensity matrix Q is
sparse with the average of c NA nonzero values per row or per column.
In each RK4 step we need 12 matrix multiplication, each with complexity
cNA2 , and about 60 other operations running in cNA time but requiring the
initialization of the matrix. Thus, T = κ2 kr(60+8c)NA2 , where k is a number
of splits or merges and r is the average number of steps in the RK4 algo-
rithm for a single time interval. Usually r < 100, especially when we use an
adaptive step control algorithm. The method is feasible even for NA ≈ 1000
and κ > 10.
We need to store n intensity matrices and n2 joint distributions in the
algorithm, therefore M = 8(n2 + n)NA2 bytes. As we see, the memory limit
is manageable even for large realistic cases.
10
scenario, but we vary the number of generations that follow the last split
over the 0 to 100, 000 range. The execution time increases logarithmically as
a function of the simulation time.
The program that calculates the joint distributions for a demography
described by an input script file is available on the website
http://sun.aei.polsl.pl/t̃wojdyla/genpop/.
3. Results
In this section we present several examples of how the DN Model can
be applied to analyze data. In the first application we model a simple two-
population network and use our model to determine values of the linkage
disequilibrium. In the latter applications we apply the method to modeling
of ancestral genetic data.
11
3.2. Predictions and estimates of species and populations history.
3.2.1. Example 1. Cro-Magnoids, Neanderthals and Modern Europeans.
Availability of genetic data from different species and populations allows
various, often very sophisticated, intra- and inter-population analyses. Par-
ticularly, it is possible to estimate the past demography of these populations,
including interactions between populations. An example of a commonly used
model-based analysis [5] is provided by the works of Barbujani’s group who,
among other, studied the genetic relationship among Neanderthals (N), Cro-
Magnoids (CM) and modern Europeans (M) [44]. The authors calculated
the values of parameters such as pairwise difference or haplotype diversity
for samples drawn from populations simulated under about a dozen hypothet-
ical demographic scenarios, and compared these values to the data obtained
from DNA of the individuals in the sample. Simulations followed the so-
called serial coalescent [31], a reverse-time multipopulation algorithm. Area
of applicability of our demographic network model overlap with that of the
serial coalescent. We apply DN Model to the scenarios used by Barbujani’s
group in paper [44], to obtain the estimates of pairwise differences among
populations. Details are relegated to the Appendix; results are in a good
agreement with ref. [44].
12
calculated the FST values for simulated populations, and, finally, we esti-
mated model parameters based on these simulation-based values using the
DN Model. In the fastsimcoal2 simulations we used population structures
exactly as defined in our models. We also used such sample sizes as they
are given by the genealogic data. Overall, we performed 20 simulations of
the model describing agricultural spread. The number of simulations is low
in this case because of the necessity to manually run the Nelder-Mead local
search method multiple times for each simulation.
We determined for each obtained set of the FST values the range of model
parameters values that minimizes the error function δ in a way explained in
the Appendix. In Figure 5A-D we present the values of four model parame-
ters, NA , NF , NHG and m, as function of the number of simulations (out of
20) in which the obtained range of the values of a given parameter contains
that parameter value. We notice that the functions obtained for the popula-
tion sizes have peaks near the values obtained directly from the DN Model.
The migration rate is more widely spread and its peak values are lower than
our estimate (equal to 0.4). Nonetheless, the results support the hypothesis
of nonzero migration rate between farmers and hunters-gatherers.
13
logroup of the non-Indo-European North-Eurasian Finns who entered Europe
during the Corded Ware period at the beginning of the 3rd millenium BC
[52]. Balts, after the split from other Indo-European groups, settled along the
south-eastern coast of the Baltic Sea and assimilated with the Finnic tribes
living there [53]. Our studies concern the influence the N1c1 haplogroup had
on the Balts and Slavs. The exact times of splits of Balts and Slavs, or Finns
from their ancestral populations are unknown and estimating these times is
one of the objective of current example. Another interesting aspect of the
Balt-Slav relationship are migrations between these two groups in the 6th
century, when Slavs appeared for the first time on the territory of Poland,
and in the 14th century, when the Commonwealth of Poland and Lithuania
(1384 - 1795) came into existence. Figure 4A illustrates the modeled de-
mographic scenario. We estimate the ancient population sizes of all groups
based on the assumption that human population growth rates are equal to
Kremer’s values [54] and assume that the population size between two ad-
jacent Kremer’s times changes according to an exponential function with a
single human generation assumed to last 25 years, with the resulting rates
listed on the left in Figure 4A. Following Barbujani’s group approach [44] we
assume that the effective to census population size ratio in humans is placed
between 0.3 [55] and 0.5 [56], and since considering the Y-STR haplotype
chromosome introduces a factor 1/4 to this ratio [57], we assume that the
effective population size is ten times smaller than the census data size. These
estimates are in apparent contradicition to the estimates of the recognized
estimates of the effective population size of modern humans, of the order of
104 − 105 [58]. However, in the current setting, it seems more appropriate to
consider the ”ecological” effective population sizes, as ref. [44] does.
We use the Slatkin’s RST [59] to quantify the distance between two pop-
ulations. We calculate the RST distance between Slavs and Balts based on
the data from 919 unrelated male Polish individuals sampled from six geo-
graphical regions of Poland and 297 Balt descendants (152 from Vilnius and
145 from Riga). Genetic data at nine microsatellite loci ( DYS19, DYS389I,
DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385a and DYS385b)
are considered. The data can be obtained from ref. [60] or [61]. We use the
Arlequin program [62] to obtain a normalized [63] value of the Slatkin’s dis-
tance. We obtain that the RST value for samples of Poles and Balts is equal
to 0.03862.Our aim is to adjust the parameters of the scenario in Figure 4A
to obtain a model explaining the obtained value of the RST distance.
Developing a version of the Lyapunov Equation, suitable for
14
microsatellite loci. It is known that under the Stepwise Mutation Model
(SMM), it is sufficient to model the differences in the number of tandem
repeats between loci rather than the absolute numbers [64]. Therefore, dis-
tributions Rab (k, k+i)(t) become infinite vectors (rab (i, t), i ∈ Z) with compo-
nents indexed by all integers and consecutive entries corresponding to these
differences. For practical application, we consider that the possible value of
the differences is finite, and unlikely to be high. The center entry rab (0, t) of
the vector corresponds to the case when no change of the number of tandem
repeats occurred. As a result, we replace the Lyapunov equation (4) by the
following specialized equation:
∂ R̂xy (s, t) ∂ R̂xy (s, t) ∂ R̂xy (s, t)
∞
2
Vxy (t) = i rxy (i, t) = + 1− (13)
i=−∞
∂s2 ∂s ∂s s↑1
15
required value of the Slatkin’s distance may be obtained by more than one set
of parameter’s values. Therefore, although the estimates provide information
about common history of modeled groups, they are not sufficient to deter-
mine the exact scenario. The estimates depend on the effective population
sizes, which cannot be precisely estimated, but recently developed methods
indicate improvement in this area, especially if the genetic data are of good
quality [30].
We also considered a scenario according to which the Balts were not
admixed by Finns (consider Figure 4A leaving out the Finns). The RST
distance is presented in Figure 4E, as a function of the Balt-Slav split time
T1 . It is seen that in this case the RST distance is dramatically lower than
that computed based on microsatellite data.
To verify our method of estimating the RST distance between populations
we compare our results to the results generated by fastsimcoal2 [33]. We con-
sider three models: i) our Slavs-Balts model with parameter values m = 0.2,
T1 = −1500, T2 = −3500 and T3 = −15000, ii) population with constant
size of 3000 experiencing split into two populations of sizes 2000 and 1000
which happened g generations before the present time (g varies between 100
and 1000), and iii) model from ii) with additional migration event occuring
100 generations after the split. In each serial coalescent experiment we gen-
erate 10 loci samples and average the RST values over 20 random coalescent
genealogies. In each case, the difference in the RST estimates between the
two methods is lower than the standard deviation of the serial coalescent
results. In model one in which we fitted the DN Model to the data-based
RST = 0.03862, we obtained from the serial coalescent method that the RST
value between Slavs and Balts in our Slavs-Balts demographic model is equal
to 0.0555 ± 0.0344 (Figure 4F). DN Model estimates calculated for the sec-
ond model for g = 1000 give RST = 0.456, while from fastsimcoal2 we obtain
0.396 ± 0.263. Introducing the migration event decreases the RST values to
0.336 using the DN Model and to 0.283 ± 0.211 for fastsimcoal2.
Additionally, we performed a bootstrap analysis of the Slavs-Balts model,
in which we assume that times of the splits estimated in the model are
equal to the previously obtained values (T1 = −1500, T2 = −3500 and
T3 = −15000) and under such assumptions we calculate migration rate m.
Figure 5E presents the cumulative number of occurences of that migration
rate, for a range of migration values discretized using a 0.01 step, in a sam-
ple of 100 simulations. The median value of the migration rate is equal to
0.17 and is close to the value estimated originally by DN Model (which was
16
0.2). Moreover, for most simulated demographies, the value of the estimated
migration rate is between 0.05 and 0.3.
4. Discussion
In this paper, we introduced mathematics and applications of the demo-
graphic networks. The main advantage of the DN Model is its ability to
efficiently compute the joint allelic distributions in pairs of haploid genomes
when more than two populations are involved. As a comparison, in the
framework of diffusion approximations, which provide another frequantly
used methodology (cf. Poisson random fields [68]) such computations fre-
quently seem less straighforward.
Returning to the mathematical core of the DN Model, we may list a
number of limitations. Our model as of now, concerns neutral evolution only
and does not involve recombination. In addition, the mathematics will be
substantially more complicated if distributions not of pairs, but of larger
samples of individuals, are modeled. The computational complexity in such
cases remains to be explored. Recombination has not been incorporated in
the algorithm, although it may be drawn into the picture for example by
extending the algorithm of Polanska and Kimmel [69, 19]. Finally, model-
ing of a wider range of different mutation patterns is to be perfected. This
concerns, among other, modeling of data from genome sequencing. Selection
will most likely be difficult to take into account. However, even modeling of
benchmark neutral scenarios may be a helpful alternative to existing simula-
tion or exact methods. This is especially true if the network contains many
populations.
We provided three examples of application of the DN Model. Example 1
concerns evolutionary and demographic scenarios linking Cro-Magnon peo-
ple, Modern Humans and Neanderthals which were previously considered by
Belle et al. [44], using the serial coalescent techniques. We show that de-
mographic networks produce similar results, as the serial coalescent (Figure
6E).
In Example 2, we study the genetic relationships between hunters-gatherers
and farmers in early Neolithic Europe based on the mtDNA genetic data ob-
tained from ancient human relics of both groups found on the territory of
modern Germany. Our results are consistent with a high migration rate
(m = 0.4) between those populations. This contradicts the alternative the-
ory claiming that a farming lifestyle in Europe spread solely be technological
17
transmission. Low, but slightly higher, values of the error function used in
our model were also obtained for m > 0.8, what suggests that the ancient
genetic material of hunters-gatherers might have been replaced by the farm-
ers almost completely. The general outcome is consistent with the results
obtained previously in ref. [70] and [10] using different approaches.
Example 3, which is original, concerns the little-known demographic ge-
netics of the peoples of Central Eastern Europe: Slavs, Balts and Finns.
Analysis of a demographic network model applied to Y-chromosome mi-
crosatellites indicates that it is necessary to consider the admixture of the
Proto-Finns into the Balts, to explain the genetic distance between these
latter and the Slavs (as represented here by Poles). This result is obtained
using the documented and hypothetical timings of historical and prehistori-
cal events and a search in parameter space. Despite a number of unknowns,
the assertion appears robust.
Benchmark results indicate the main areas of use of our model. Our
approach is useful in modeling the evolution of non-recombining genetic ma-
terial from many populations organised in a complex demographic network,
especially when the number of possible allele variants is small. This includes
using mtDNA or Y chromosome data represented either by short haplotype
sequences or microsatellite loci. In such cases our model can be applied for
complex scenarios.
Acknowledgments
A significant part of the research was conducted during Tomasz Wo-
jdyla’s visits in the Department of Statistics at the Rice University. This
work was supported by the Polish National Science Center grant: DEC–
2012/04/A/ST7/00353 (T.W.) and the NCBiR grant POIG 02.03.01-24-099.
(M.K.).
18
Figure 1: Benchmarks. (A) Influence of the allelic state space size on execution times
of DN Model. Single population with constant size, evolving over 10, 000 generations,
with different sizes of the allelic state space NA . The figure presents execution times of
the DN Model run on an average personal computer. (B) Influence of the number of
populations in the network on DN Model execution times. Ancestral population keeps
splitting into two populations after each 100 generations until it reaches the total number
of n populations. After that, all n populations evolve separately. The scenario covers
10, 000 generations of the network evolution. We consider two different sizes of the allelic
state space NA = 2 and NA = 32. Increasing the number of populations x times results in
an increase of the execution time x2 times. (C) Influence of the model time on execution
time of DN Model. Ancestral population keeps splitting into two populations after each
100 generations reaching the total number of 10 populations. After that, all populations
evolve separately for a given number of generations. Increasing the number of generation
10 times leads to doubling of the execution time.
tion; the split event occurs at time t = 0. The ancestral population is assumed to be in a
mutation-drift equilibrium.
19
Figure 4: Balts-Slavs-Finns example. (A) Demographic model. The model includes:
Eurasian Ancestors (EA), Indo-Europeans (I), Finns (F), Slavs (S), Balts (B) and Poles
(P). Population sizes before the year 1AC are estimated using Kremer [54] rates. Values
listed on the left are growth rates per generation with a generation of 25 years. Two
bottleneck events indicate: (i) S and B tribes leaving the Indo-European homeland and
(ii) emergence of P from S. The size of the bottlenecks only slightly changes the value of
RST provided that the population after the bottleneck is at most 1/3 of the original size.
We assume the following values of the bottleneck size ratios: 1/5 in the case of SB and 1/7
for P. This is also consistent with neglecting the bottleneck when I and F split off from
the ancestral EA. Four parameters are varied: (i) T1 – time of split of B and S, (ii) T2 –
time of BS leaving the Indo-European homeland (iii) T3 – time of split of I and F, and (iv)
m – migration rate between B and P. The exact time of the migration does not influence
the RST value. (B-D) RST distance between Slavs and Balts as a function of: (B) the
Balt-Slav split time T1 , (C) the time T2 when Balts and Slavs left the Indo-European
homeland and (D) the Finn-Indo-European split time T3 . Populations evolve according
to the model presented in panel (A) with fixed times set to T1 = −1500, T2 = −3500
and T3 = −15000. Results for four different values of the migration rate m are depicted.
Dotted line is the data-based estimate of RST . (E) Impact of Proto-Finns on the RST
distance between Slavs and Balts. RST distance between Slavs and Balts as a function of
the Balt-Slav split time T1 . Populations evolve according to the model presented in panel
(A) with T2 = −3500 but without Proto-Finns population. Results for four different values
of the migration rate m are depicted. Discarding Proto-Finns substantially decreases RST
estimates. (F) Empirical cumulative distribution (ECD) of the fastsimcoal2 Monte-Carlo
estimates of RST distance between Slavs and Balts. 20 experiments estimating the RST
distance between Slavs and Balts according to the demographic model of panel (A) have
been run using fastsimcoal2. The figure presents empirical cumulative distribution of the
results. Averaging over all exeriments we obtain that RST = 0.0555 ± 0.0344. Model
distance is equal to 0.03862.
Figure 5: Bootstrap analysis of the obtained results. (A-D) Bootstrap analysis of the
agricultural spread model. 20 demographies have been generated using fastsimcoal2 ac-
cording to the demographic model in Figure 3. The Nelder-Mead algorithm has been used
repeatedly for each demography and 4 model parameter values have been estimated (each
parameter value in each experiment is described by the range). The 4 model parameters
are: ancient population size NA , farmer population size NF , hunter-gatherer population
size NHG and migration rate between farmers and hunters-gatherers. The figure presents,
for each of those parameters, number of demographies supporting particular parameter
value. (E) Bootstrap analysis of the Balts-Slavs-Finns model. 100 demographies have been
generated using fastsimcoal2 according to the demographic model in Figure 4A. Then, DN
Model has been used for each demography to estimate migration rate between Slavs and
Balts. Figure presents empirical cumulative distribution of demographies supporting mi-
gration rates discretized using 0.01 step on the horizontal axis.
20
Figure 6: Europeans - Cro-Magnoids - Neanderthals example (A-D) Types of joint
demographic models of modern Europeans (M) - Cro-Magnoid (CM) - Neanderthals (N
and N’). The numbers at the left are times in human generations (25 years per generation)
counted backwards from the present. The models used in the paper are: L1.1, L1.2 and
H1.1 (A), L1.5 (B), L1.3, L1.4, H1.2 and H1.3 (C), L1.7 (D) [44]. The detailed values of the
growth rates and population sizes for each model may be found in [44]. (E) Comparison
of two methods: serial coalescent and demographic networks. The figure presents pairwise
difference obtained by applying serial coalescent and DN Model to three populations: Cro-
Magnoid (full circles), Neanderthal (open circles) and modern European (crosses) under
different demographic scenarios (listed in the legend and explained in ref. [44]). Infinite-
site model, approximated by haplotypes of 360 nucleotides, was assumed.
21
Table 2: FST distances for several Near Eastern and European populations. The
table contains the values of the FST distance among four populations: ancient hunters-
gatherers (HG) found on the territory of modern Germany, ancient farmers (LBK) from
graveyard in Derenburg in Germany, modern Central Europeans (CE) and modern Near
Easterners (NE). The values are estimated using mtDNA genetic data from the HVS-I
region.
HG LBK CE NE
HG *
LBK 0.09298 *
CE 0.03445 0.03958 *
NE 0.04192 0.03019 0.00939 *
Appendix A.
Example 1
The models used in our calculations are taken from the Figure 1 in ref.
[44]. All the models assume that the Neanderthals (N) and Cro-Magnoids
(CM) lived 1700 and 960 generations ago, respectively, with generation time
assumed equal to 25 years. All the models except L1.7 assume a single popu-
lation with different growth rates. Model L1.1 is a constant size population.
In L1.2, the population grows after the origin of CM. Models L1.3 and L1.4
introduce to L1.2 a small growing rate before origin of N; rapid expansion
from CM to Moderns (M) is assumed in L1.4. In L1.5 the population grows
to the same large size as in L1.4, but with more balanced growth rates in
each time interval. L1.7 models the population from L1.5 with an assump-
tion that there existed a separate shrinking N population. Models labeled
by H have assumed mutation rate ten times larger (0.5 per million years per
nucleotide instead of 0.05 as in models staring with L). Demography of H1.1
is the same as in L1.2. Demographies of H1.2 and H1.3 are slightly different
versions of L1.3. We summarize all of these demographies in Figure 6A-D.
For more details see ref. [44]. Both models assume infite site mutation model
approximated by a long haplotype sequence of 360 nucleotides with two pos-
sible variants at each position. Given the joint distribution Rxx (t) we can
calculate the mean pairwise difference ϕx (t) in population x according to the
following formula:
ϕx (t) = d(i, j)Rxx (i, j)(t), (A.1)
i,j∈A
22
where d(i, j) is the number of positions at which the sequences of allelic
types i and j differ. Table 1 and Figure 6E compare our results to those
obtained by Belle and co-workers in ref. [44] using serial coalescent approach.
Discrepancies between simulation and our methods, which are largest for the
CM population, are as expected considering a very small CM sample size
(only 2 individuals), and the fact that median (not mean) value was listed
by Belle et al. The sample sizes of N and M populations used in the serial
coalescent approach were equal to 6 and 558, respectively.
Example 2
The data we use are predominantly from ref. [10], complemented by other
sources. As Hunter-Gatherer (HG) population we used 20 individuals of post-
LGM (Last Glacial Maximum) hunters-gatherers with mtDNA haplogroup
U found on the territory of modern Germany [70]. It is assumed that these
individuals lived about 13,000-10,000 years ago. The genetic data of the
Farmer (F) population were mostly obtained from the Derenburg graveyard
in Germany [71, 10] and consist mtDNA samples of 42 farmers from Neolithic
Linear Pottery Culture (LBK). The age of the graveyard has been estimated
to be about 7,200 years. The area of the LBK consists of territories close to
the middle Danube, the upper and middle Elbe, and the upper and middle
Rhine. Two remaining populations are 1030 modern Central Europeans (CE)
from the LBK core area and 737 modern Near Easterners (NE) from Anatolia.
The values of the FST distance between each pair of populations have been
obtained from [70] and [10] and are summarized in Table 2.
We assume that before the LGM the ancestors of HG had separated
from the common ancestral population (A) and migrated to Europe (Figure
3). Then, about 11,000 years ago the farmers started to spread in Europe
mixing with the HG individuals at the total migration rate m = m1 + m2 ,
where m1 stands for migration before the time of graveyard in Derenburg
and m2 after this time. Post-neolithic demography is modeled by CE-NE
admixture at rate M at the present time with the fraction of the proper
mtDNA haplogroup U distribution in modern Europe assumed to be equal to
15%. Population size growth rates before the year 2000 BP are as estimated
by Kremer [54], and after that year we assume exponential growth of the
haplogroup U carriers up to the 15% of the CE and NE population sizes
at the present time. As a mutation model we use a simple one-base two-
allele haplotype model with recurrent mutations and mutation rate equal to
3 × 10−5 per base per generation [72]. We also assume that the ancestral
23
population was in a drift-mutation equilibrium during the time of split of
the HG population. The value of the FST at time t between populations
x and y may be calculated using the joint distributions Rxx (t), Rxy (t) and
Ryy (t) as follows:
Rxx (a, A)(t) + Rxx (A, a)(t) + Ryy (a, A)(t) + Ryy (A, a)(t)
FST = , (A.2)
2(Rxy (a, A)(t) + Rxy (A, a)(t))
24
References
[1] R. E. Green, J. Krause, A. W. Briggs, T. Maricic, U. Stenzel, et al., A
draft sequence of the Neandertal genome, Science 328 (2010) 710–722.
doi:10.1126/science.1188021.
[2] M. Rasmussen, Y. Li, S. Lindgreen, J. S. Pedersen, A. Albrechtsen,
et al., Ancient human genome sequence of an extinct Palaeo-Eskimo,
Nature 463 (2010) 757–762. doi:10.1038/nature08835.
[3] M. Meyer, M. Kircher, M.-T. Gansauge, H. Li, F. Racimo, et al., A
high-coverage genome sequence from an archaic Denisovan individual,
Science 338 (2012) 222–226. doi:10.1126/science.1224344.
[4] J. Hey, C. A. Machado, The study of structured populations – new hope
for a difficult and divided science, Nature Reviews Genetics 4 (2003)
535–543. doi:10.1038/nrg1112.
[5] M. Stoneking, J. Krause, Learning about human population history from
ancient and modern genomes, Nature Reviews Genetics 12 (2011) 603–
614. doi:10.1038/nrg3029.
[6] M. A. Beaumont, B. Rannala, The Bayesian revolution in genetics, Na-
ture Reviews Genetics 5 (2004) 251–261. doi:10.1038/nrg1318.
[7] A. Wollstein, O. Lao, C. R. Becker, S. Bauer, R. J. Trent, et al., De-
mographic history of Oceania inferred from genome-wide data, Current
Biology 20 (2010) 1983–1992. doi:10.1016/j.cub.2010.10.040.
[8] J. Hox, Multilevel Analysis: Techniques and Applications, Lawrence
Erlbaum Associates, New Jersey, 2002.
[9] M. Arenas, J. S. Lopes, M. A. Beaumont, D. Posada, Codabc:
A computational framework to coestimate recombination, substitu-
tion, and molecular adaptation rates by approximate Bayesian com-
putation, Molecular Biology and Evolution 32(4) (2015) 1109–1112.
doi:10.1093/molbev/msu411.
[10] W. Haak, O. Balanovsky, J. J. Sanchez, S. Koshel, V. Zaporozhchenko,
et al., Ancient DNA from European Early Neolithic farmers reveals
their Near Eastern affinities, PLoS Biology 8(11) (2010) e1000536.
doi:10.1371/journal.pbio.1000536.
25
[11] G. Hellenthal, A. Auton, D. Falush, Inferring human colonization his-
tory using a copying model, PLoS Genetics 4(5) (2008) e1000078.
doi:10.1371/journal.pgen.1000078.
26
from multidimensional SNP frequency data, PLoS Genetics 5(10) (2009)
e1000695. doi:10.1371/journal.pgen.1000695.
27
[30] H. Li, R. Durbin, Inference of human population history from
individual whole-genome sequences, Nature 475 (2011) 493–496.
doi:10.1038/nature10231.
28
[41] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, Numer-
ical recipes in C - the art of scientific computing, Cambridge University
Press, 1988.
[51] E. group, Origins, age, spread and ethnic association of european hap-
logroups and subclades, http://www.eupedia.com/genetics/.
29
[53] H. L. Thomas, Archaeology and Indo-European comparative linguistics,
Reconstructing languages and cultures 58 (1992) 281–316.
30
[64] M. Kimmel, R. Chakraborty, Measures of variation at DNA repeat loci
under a General Stepwise Mutation Model, Theoretical Population Bi-
ology 50(3) (1996) 345–367. doi:10.1006/tpbi.1996.0035.
[71] W. Haak, et al., Ancient DNA from the first European farm-
ers in 7500-year-old Neolithic sites, Science 310 (2005) 1016–1018.
doi:10.1126/science.1118725.
31
Figure 1
Click here to download high resolution image
Figure 2
Click here to download high resolution image
Figure 3
Click here to download high resolution image
Figure 4
Click here to download high resolution image
Figure 5
Click here to download high resolution image
Figure 6
Click here to download high resolution image