You are on page 1of 39

Accepted Manuscript

Genetic demographic networks: Mathematical model and applications

Marek Kimmel, Tomasz Wojdyła

PII: S0040-5809(16)30025-9
DOI: http://dx.doi.org/10.1016/j.tpb.2016.06.004
Reference: YTPBI 2539

To appear in: Theoretical Population Biology

Received date: 7 June 2015

Please cite this article as: Kimmel, M., Wojdyła, T., Genetic demographic networks:
Mathematical model and applications. Theoretical Population Biology (2016),
http://dx.doi.org/10.1016/j.tpb.2016.06.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a
service to our customers we are providing this early version of the manuscript. The manuscript
will undergo copyediting, typesetting, and review of the resulting proof before it is published in
its final form. Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
*Manuscript

Genetic Demographic Networks: Mathematical Model


and Applications
Marek Kimmela,∗, Tomasz Wojdylab
a
Department of Statistics, Rice University, 6100 Main Street, Houston, TX 77005, USA
and Systems Engineering Group, Silesian University of Technology, Akademicka 16,
44-100 Gliwice, Poland
b
Institute of Automatic Control, Silesian University of Technology, Akademicka 16,
44-100 Gliwice, Poland

Abstract
Recent improvement in the quality of genetic data obtained from extinct
human populations and their ancestors encourages searching for answers to
basic questions regarding human population history. The most common and
successful are model-based approaches, in which genetic data are compared
to the data obtained from the assumed demography model. Using such ap-
proach, it is possible to either validate or adjust assumed demography. Model
fit to data can be obtained based on reverse-time coalescent simulations
or forward-time simulations. In this paper we introduce a computational
method based on mathematical equation that allows obtaining joint distri-
butions of pairs of individuals under a specified demography model, each of
them characterised by a genetic variant at a chosen locus. The two individu-
als are randomly sampled from either the same or two different populations.
The model assumes three types of demographic events (split, merge and mi-
gration). Populations evolve according to the time-continuous Moran model
with drift and Markov-process mutation. This latter process is described by
the Lyapunov-type equation introduced by O’Brien and generalized in our
previous works. Application of this equation constitutes an original contri-
bution. In the result section of the paper we present sample applications
of our model to both simulated and literature-based demographies. Among

Corresponding author, Phone: 713-348-5255, Fax: 713-348-5476


Email addresses: kimmel@rice.edu (Marek Kimmel ), tomasz.wojdyla@gmail.com


(Tomasz Wojdyla)

Preprint submitted to Theoretical Population Biology June 20, 2016


other we include a study of the Slavs-Balts-Finns genetic relationship, in
which we model split and migrations between the Balts and Slavs. We also
include another example that involves the migration rates between farmers
and hunters-gatherers, based on modern and ancient DNA samples. This lat-
ter process was previously studied using coalescent simulations. Our results
are in general agreement with the previous method, which provides valida-
tion of our approach. Although our model is not an alternative to simulation
methods in the practical sense, it provides an algorithm to compute pairwise
distributions of alleles, in the case of haploid non-recombining loci such as
mitochondrial and Y-chromosome loci in humans.
Keywords: Demographic Networks, Gene Genealogies, Allele Joint
Distributions, Finns-Balts-Slavs History, Farming Expansion,
Cro-Magnoids and Neanderthals

1. Introduction
With increasing availability of genome-wide data, including data from
extinct species, new questions about history of species and populations are
asked and answered. Only in the recent years we have obtained genome
sequences from three main hominins: Neanderthals [1], ancient modern hu-
mans [2] and high quality 30-fold coverage of Denisovans [3]. These data
preserve traces of information about past events occurring in the population
history, such as founder effects, bottlenecks, migrations, admixtures and so
forth. Historically, there were two distinct approaches [4] to retrieve this
knowledge: the phylogeographic approach based on the gene-tree analysis,
and the summary statistics that concentrated on comparing some aspect
of data, such as the numbers of variable sites in or between populations.
Recently, the model-based analysis [5] combines both approaches. In this
approach, estimates of the parameters obtained from assumed demographies
are directly compared to real genetic data. An important version of this
approach is the reverse-time coalescent approach. In its Bayesian version,
the posterior probability of a model parameters data can be obtained and in
this way different models can be compared. This approach has been widely
developed over recent years [6, 7]. Bayesian methods are frequently computa-
tionally intensive [8]. Nevertheless, the Approximate Bayesian Computation
(ABC) has been used a lot to study models with more than 3 populations
and generally high complexity [9].

2
Over the last few years, many theories regarding human ancient history
have been tested using the methods described. For example, in ref. [10]
authors discuss ancestry of the Eastern Europe Neolithic farmers and infer
their Near East affinities. In reference [11] different models of human col-
onization history have been tested. As an example, genetic material of an
archaic hominin has been found in the Denisova Cave in Siberia and its ge-
netic history and relationship to modern humans from several population as
well as to Neanderthals have been discussed in ref. [12, 3] to infer human
population history from individual whole-genome sequences.
One approach to calculate estimates of the demography parameters is us-
ing model-based samples generated via forward-time generation-to-generation
or backward-time coalescent-based computer simulations. Extensive review
of such programs has been introduced in reference [13], and more recently in
reference [14] listing the following three simulation platforms as the leading
ones: quantiNEMO [15], ForSim [16] and simuPop [17]. The results obtained
from simulations need to be averaged over many runs in order to correctly
estimate population-specific values.
Here, we introduce a computational Demographic Network (DN) Model
that allows developing predictions of bivariate allele distributions in pairs
of individuals sampled from the same or two different populations, given a
potentially complicated demography. The two individuals are randomly sam-
pled from either the same or two different populations. The model assumes
three types of demographic events (split, merge and migration). Popula-
tions evolve according to the time-continuous Moran model with drift and
Markov-process mutation. This latter process is described by the Lyapunov-
type equation introduced by O’Brien and generalized in our previous works;
for references see the recent short monograph [18]. This equation has been
generalized in several directions, including process with recombination [19].
In the current version of the DN model, recombination has not been taken
into account. In some applications, the DN Model may complement the
simulation models. Although the DN is not an alternative to simulation
methods in the practical sense, it provides an algorithm to compute pairwise
distributions of alleles, in the case of haploid non-recombining loci such as
mitochondrial and Y-chromosome loci in humans. The DN Model leads to
exact pairwise distributions, and usually runs very fast.
The DN Model is not intended to replace the demographic inference pro-
cedures that infer demography from the site frequency spectrum, and can
summarize whole-genome data from multiple populations. However, it is

3
still interesting, to compare DN-network performance with other methods,
which have been summarized for example in ref. [20] or [13]. Many popular
methods developed so far limit, for computational reasons, the number of
populations in the model to a single population (BEAST [21]), two (IM [22],
IMa [23], [24]) or three (∂a∂i [20]) populations. Other methods with mul-
tiple populations have no subsequent migration after subpopulations split
[25]. Methods that consider multiple populations with migration often as-
sume limited population size growth scenario (LAMARC [26], MIGRATE N
[27]) or mutation model (no microsatellite model in BEAST [21] or constant
mutation rate [26]). Some approaches allow using only a limited set of sum-
mary statistics [28] or are computationally very intensive ([29] or [26] for
number of populations greater than 3). The authors of the Denisovans paper
[12] used Li and Durbin’s method based on the PSMC model [30]. The ap-
proach we use for comparison is simCoal [31] and its new versions fastsimcoal
[32] and fastsimcoal2 [33]. Our approach allows flexibility in the definition
of mutation model (see Example 3 in which we consider the microsatellite
model) and population growth scenario (we can accomodate any realistic
scenario). DN Model can be coupled with a local search algorithm, such as
the Nelder-Mead search algorithm used in Example 2, to draw conclusions
about the early Neolithic transition from hunting-gathering to farming.

2. Methods and models


We consider a demographic network of populations evolving from an an-
cestral population. Evolution in the network begins at time t0 = 0 and con-
tinues in forward time. The network includes three types of discrete events:
merges of two populations into one, splits of a single population into two
and migrations between pairs (possibly all) of populations in the network
(Sections 2.2 and 2.3). Between the events, the populations evolve according
to a time-continuous version of the Wright-Fisher model, further on called
the Moran model as described in Section 2.1.

2.1. Evolution of the network distributions between the events.


Between the events, populations evolve according to the time-continuous
Moran model with drift and Markov-process mutation. This latter process is
described by the Lyapunov-type equation introduced by O’Brien and gener-
alized in [34]; however, see also the brief monograph by Bobrowski and Kim-
mel [18]. Current section follows ref. [34]. We assume that each chromosome

4
evolves under genetic drift and mutation between two consecutive network
events. We assume that the allelic state Xa (t) ∈ A of the chromosome sam-
pled from population a at time t evolves as a time-continuous non-negative
(a)
Markov chain with transition  intensity matrix Qa = {qjk }, 1 ≤ j, k ≤ NA ,
where qjk ≥ 0, j = k and ∀j k qjk = 0. We denote Rab (j, k)(t) = P [Xa (t) =
j, Xb (t) = k], where j, k ∈ A and 0 ≤ a, b < κi if t ∈ [ti , ti+1 ). We assume
that the matrix Qa remains unchanged between two demographic events but
may vary among different populations or different time intervals (the state
(a)
space remaining the same). By Pa (t) = {pjk (t)}, 1 ≤ j, k ≤ NA we denote
the transition probability matrix corresponding to matrix Qa . In the finite-
dimensional case (if A is finite) we obtain Pa (t) = eQa t . The same holds for
the uniform denumerable case, as explained in Section 6.10 of ref. [35].
Let us first consider two alleles randomly drawn from the same population
a, and assume that the MRCA of these two individuals with allelic types aj
and ak (aj , ak ∈ A, 1 ≤ j, k ≤ NA ) existed at time Tjk before the present
time t. Based on the Kingman-Moran coalescent [36], 
with time-variable
− 0τ N (t−u)
du
population size Na (t) we obtain that P [Tjk > τ ] = e a . The MRCA
can be of any allelic type ai with probability πi (t − Tjk ) = P [X(t − Tjk ) = ai ]
and its descendants at the present time t have types aj and ak , respectively.
Then, summing over all possible values of i and integrating over the range
of Tjk results in [34]:
 ∞  t
1 du
Raa (j, k)(t) = πi (t − τ )pij (τ )pik (τ ) e− t−τ Na (u) dτ (1)
0 1≤i≤N Na (t − τ )
A

Expression (1) may be transformed into matrix notation and we can sep-
arate the evolution of the population in the time interval before t = 0 and
interpret it as the initial conditions [34]. This leads to the following equation:

t  t
T − du 1 − τt Ndu(u)
Raa (t) = P (t)R(0)P (t)e 0 Na (u) + P T (t−τ )Π(τ )P (t−τ ) e a dτ,
0 Na (τ )
(2)

where Raa (t) = {Raa (j, k)(t)}, P T is the transpose of the matrix P and
Π(t) is a diagonal matrix with Π(t)ii = πi (t).
The case when the two alleles are drawn from different populations a and
b is simpler as there is no coalescence. In summary,

5
 t du

PaT (t − ti )Rab (ti )Pb (t − ti )e ti Na (u)
+ Sa (ti , t) a=b
Rab (t) = (3)
PaT (t − ti )Rab (ti )Pb (t − ti ) a = b,
t  t du
where Sa (ti , t) = ti PaT (t − τ )Π(τ )Pa (t − τ ) Na1(τ ) e− τ Na (u) dτ.
Rab (t) given by expression (3) is a mild solution [37] of the following
matrix differential equation related to the Lyapunov equation [38]:

dRab (t) δab


= QTa Rab (t) + Rab (t)Qb + (Π(t) − Rab (t)), (4)
dt Na (t)

where t ∈ [ti , ti+1 ), 0 ≤ a, b < κi and δab is the Kronecker delta.


We will use equation (4) rather than expressions (3) to evaluate the joint
distribution in the time interval between two adjacent events in the network.

2.2. Description of a demographic network.


We consider a demographic network of populations evolving from an an-
cestral population. Evolution in the network begins at time t0 = 0 and con-
tinues in forward time. The network includes three types of discrete events:
merges of two populations into one, splits of a single population into two
and migrations between pairs (possibly all) of populations in the network.
These events are chronologically ordered and occur at times ti , 1 ≤ i ≤ I
and ti ≤ ti+1 , where tI is the present. We allow more than one event to occur
at the same time, but these events are distinguished from each other and are
considered separately one after another according to the numbering order.
We denote the number of populations in the network in the time interval
[ti , ti+1 ) by κi ≥ 1, where κ0 = 1 and κi = κi−1 + γi , and γi is the indicator
of change of the number of populations corresponding to the type of event
which occurred at time ti :


⎨−1 split
γi = 0 migration (5)


1 merge

Each population in the network is identified in the time interval [ti , ti+1 )
by a single index k ∈ 0, 1, . . . , κi − 1. Numbering starting from 0 results from
the convenience of software implementation. The index of the population

6
may change as a result of an event. If the index of the population is k,
0 ≤ k < κi , in the time interval [ti , ti+1 ), then we denote by k  , 0 ≤ k  < κi−1 ,
the corresponding index in the preceding time interval [ti−1 , ti ). If population
x splits at time ti , then



⎨k k ≤ x
k = x + 1 for a newly created population (6)

⎩  
k +1 k > x.

If two populations with indices x and y (x < y) merge at time ti , then



k k < y
k=  (7)
k −1 k  > y,

i.e., the merged population has index k = x and the population with index
k  = y is removed from the network.
Migration event at time t is described by matrix M(t) = {mxy (t)}, 0 ≤
x, y < κi with entries mxy , 0 ≤ mxy ≤ 1, mxx = 0, being the migration
rates from population x to y. Migration does not change the indices of the
populations. Such migration model allows for high versatility in describing
the migration pattern among populations. Continuous migration can be
appoximated by introducing either many migration events separated by short
time periods or a single migration event for each time period between other
network events. In the latter case it is needed to estimate single migration
event rates corresponding to the rates of the continuous migration model.
The population size of population k at time t ∈ [ti , ti+1 ) is denoted by
Nik (t), 0 ≤ k < κi .

2.3. Relations between populations in the network.


We consider a genetic feature (allele) associated with a haploid chromo-
some which can be sampled from any population in the network. We describe
this feature using the allelic space A containing NA allelic types indexed from
1 to NA . We seek answer to the following question: What is the probability
that a chromosome randomly sampled from population a at time t has allele
j and that another individual from population b (a = b is admissible) has
allele k ? These probabilities are entries of the joint distribution matrices
Rab (t) = {Rab (j, k)(t)}, t ∈ [ti , ti+1 ), 0 ≤ a, b < κi and j, k ∈ A. To avoid

7
notational ambiguity, we denote by Rab (j, k) the jth row, kth column entry
of matrix Rab .
As above, if the index of the population is denoted by k, 0 ≤ k < κi ,
in the time interval [ti , ti+1 ), then this population has index denoted by k  ,
0 ≤ k  < κi−1 , if it existed in the previous time interval [ti−1 , ti ). Matrices
Rab (ti ) = limt↓ti Rab (t) and Ra b (t−
i ) = limt↑ti Ra b (t) indicate the joint al-
lelic distributions in the two populations immediately after and immediately
before the event at time ti , respectively.
If a split event occurs at time ti , the allele on the chromosome in the
splitting population is inherited by two chromosomes, each in a different
progeny populations. Hence for this case we obtain the following identity:

Rab (ti ) = Ra b (t−


i ) (8)

If the event that occurs at time ti is a merge, the allele in the merged
population is sampled from the two merging populations x and y with re-
spective probabilities p and q = 1 − p, where p = N(i−1)x (t− −
i )/[N(i−1)x (ti ) +

N(i−1)y (ti )]. This results in the following formula for the joint distributions:


⎪ Ra b (t− i ) x = a , x = b

⎨pR   (t− ) + qR  (t− )
ab i yb i a = x, b = y
Rab (ti ) = − −
(9)

⎪ pR 
ab i  (t ) + qR a y (ti ) b = x, a = y

⎩ 2
p Rxx (t− + − 2 −
i ) + 2pqRxy (ti ) + q Ryy (ti ) a = x, b = y
+
where Rxy (t) = [Rxy (t) + Ryx (t)]/2.
A single migration event from one population x to another y is considered
a merge of the whole destination population y with a part of the population x.
Only the distributions of the destination population are affected. Assuming
that the event that occurred at time ti is described by the migration matrix
M(ti ), the size of the part of the population x contributing to the event is
given by mxy (ti )N(i−1)x (t−
i ).
A migration event in the network describes all possible migrations be-
tween two populations in the network. Therefore, a single population may
be affected by a number of different migrations taking place at the same time.
It leads to complex relationships between joint distributions characterizing
the populations involved in the migration events. We model such a migration
event using the following two-step scenario:

8
• Split each population in the network κi − 1 times in order to separate
all subpopulations migrating out of populations. The population size
ratio parameters used in the splits are given by the migration matrix
M(ti ).
• Merge the migrating subpopulations determined in the previous step
with proper destination populations. It is not necessary to apply any
particular ordering to these merges.
Suppose there are κi subpopulations present. Therefore, κ2i populations (and
κ4i joint distributions) need to be stored simultaneously. To optimize this
method we may separately consider the subpopulations migrating from each
original population, and then apply the splits and merges scenario κi times,
once for each determined joint subpopulation. Merge operations immediately
follow splits. As a result, at most 2κi + 1 populations are stored at the same
time.
If the migration event changes the population sizes, the modified values
of the population size satisfy the following formula:

κi−1
 κi−1

Nix (ti ) = 1 − mxk (ti ) N(i−1)x (t−
i ) + mkx (ti )N(i−1)k (t−
i ) (10)
k=0 k=0

2.4. Time and memory complexity.


Mutation and drift effect between two events is computed according to
the equation (4). In order to numerically solve the ODE we use the Runge-
Kutta 4th order (RK4) algorithm [39, 40], with adaptive control of the step
size using the Cash-Karp method [41, 42].
Population-specific intensity matrix Q is usually sparse whereas the num-
ber of allele types NA is large. The most time-consuming operation of the
implemented Runge-Kutta algorithm is the RQ multiplication. Using spar-
sity of matrix Q significantly reduces the time complexity. We accomplish
this by storing, for a matrix with the fraction of non-zero entries not ex-
ceeding a threshold value, all non-zero values in a set of lists. Each list
corresponds to a row or column of the matrix. This results in 2nd -order
polynomial time-complexity of multiplication if at least one of the matrices
multiplied is sparse.
Time T and memory M complexities depend on the number of popu-
lations n and the size of the allelic types space NA . Time complexity also

9
depends on the mutation model. We assume that the intensity matrix Q is
sparse with the average of c  NA nonzero values per row or per column.
In each RK4 step we need 12 matrix multiplication, each with complexity
cNA2 , and about 60 other operations running in cNA time but requiring the
initialization of the matrix. Thus, T = κ2 kr(60+8c)NA2 , where k is a number
of splits or merges and r is the average number of steps in the RK4 algo-
rithm for a single time interval. Usually r < 100, especially when we use an
adaptive step control algorithm. The method is feasible even for NA ≈ 1000
and κ > 10.
We need to store n intensity matrices and n2 joint distributions in the
algorithm, therefore M = 8(n2 + n)NA2 bytes. As we see, the memory limit
is manageable even for large realistic cases.

2.4.1. Time benchmarks.


All calculations of execution times in this section have been conducted
using an average personal computer with 2GHz processor with eight cores.
We assume that the main benchmark model is a single population scenario in
which one population evolves over a time period of 10, 000 generations - long
enough to model most realistic modern human demographies. We use five
different values of the allele state space size NA : 2, 8, 32, 128 and 512. The
times presented on Figure 1A show that our model is feasible for NA ≈ 1, 000
and runs extremely fast for small sizes of NA .
Next, we add split events to the single-population scenario, being the only
events that not only increase the number of disjoint time periods in which
the population is modeled under the mutation-drift model, but also increase
the number of populations. From this point of view splits are the worst case
scenario. We start with one population and after each 100 generations we
split the ancestral population into two. After we reach the assumed number
of populations in the network, we model the network until 10, 000 generations
elapse. We consider two different values of NA : 2 and 32, and four different
values of population number: 1, 2, 5 and 10. The results are presented in
Figure 1B. The model can be applied to complex demographies with more
than 10 populations, in which the locus is a short haplotype sequence or
a realistic human microsatellite. These cases cover many applications; see
Examples 1, 2 and 3 in the Results section. None of the existing direct
calculation programs can model such demographies.
In Figure 1C we show that our program runtime only weakly depends on
the simulated time length. In this experiment, we run the multipopulation

10
scenario, but we vary the number of generations that follow the last split
over the 0 to 100, 000 range. The execution time increases logarithmically as
a function of the simulation time.
The program that calculates the joint distributions for a demography
described by an input script file is available on the website
http://sun.aei.polsl.pl/t̃wojdyla/genpop/.

3. Results
In this section we present several examples of how the DN Model can
be applied to analyze data. In the first application we model a simple two-
population network and use our model to determine values of the linkage
disequilibrium. In the latter applications we apply the method to modeling
of ancestral genetic data.

3.1. Simple example: Linkage disequilibrium caused by drift and population


size change.
We assume that at time t = 0 an ancestral population splits into two
subpopulations. The first one is the ancestral population with constant size
N = 1000. The second population starts from 1000 individuals and grows
exponentially with a parameter 0.001 per generation. We consider a single
nucleotide mutation model with recurrent mutations of two variants: A and a.
Mutation rate in both populations is constant and equal to 0.0002 per gener-
(1)
ation. We note the marginal distributions, pij = Rij (A, A)(t) + Rij (A, a)(t),
(2) (1) (2)
pij = Rij (a, A)(t) + Rij (a, a)(t), qij = Rij (A, A)(t) + Rij (a, A)(t) and qij =
Rij (A, a)(t)+Rij (a, a)(t). We compute the normalized Lewontin linkage dise-
quilibrium D between populations i and j [43]: D (t) = D(t)/Dmax (t), where
(1) (1)
D(t) = Rij (A, A)(t) − pij qij is a non-normalized linkage disequilibrium and
(1) (1) (2) (2) (1) (2) (2) (1)
Dmax (t) = min(pij qij , pij qij ) if D(t) < 0 or Dmax (t) = min(pij qij , pij qij )
otherwise. Lewontin’s index is usually applied to quantify linkage between al-
leles at different loci on the same chromosome, but here it is used to quantify
dependence between the alleles at the same locus on different chromosomes.
The results presented in Figure 2 are consistent with intuition. The joint
distribution in population 0 remains at equilibrium while in population 1 it
gradually evolves leading to a decrease of the value of D as the force of drift
decreases in a growing population. The cross-association of homologous loci
from two different populations after the split rapidly decreases.

11
3.2. Predictions and estimates of species and populations history.
3.2.1. Example 1. Cro-Magnoids, Neanderthals and Modern Europeans.
Availability of genetic data from different species and populations allows
various, often very sophisticated, intra- and inter-population analyses. Par-
ticularly, it is possible to estimate the past demography of these populations,
including interactions between populations. An example of a commonly used
model-based analysis [5] is provided by the works of Barbujani’s group who,
among other, studied the genetic relationship among Neanderthals (N), Cro-
Magnoids (CM) and modern Europeans (M) [44]. The authors calculated
the values of parameters such as pairwise difference or haplotype diversity
for samples drawn from populations simulated under about a dozen hypothet-
ical demographic scenarios, and compared these values to the data obtained
from DNA of the individuals in the sample. Simulations followed the so-
called serial coalescent [31], a reverse-time multipopulation algorithm. Area
of applicability of our demographic network model overlap with that of the
serial coalescent. We apply DN Model to the scenarios used by Barbujani’s
group in paper [44], to obtain the estimates of pairwise differences among
populations. Details are relegated to the Appendix; results are in a good
agreement with ref. [44].

3.2.2. Example 2. Agriculture spread in Europe.


In Neolithic Europe a transition occurred from hunting and gathering
(HG) to an agricultural lifestyle. The spread of farmers (F) originated around
11,000 years ago in the Near East and farming reached Scandinavia about
6,000 years ago [45]. The advantage of the new form of food acquisition
caused a complete replacement of the HG communities. What is unknown
is the manner in which the replacement proceeded. There are two theories
explaining the process. The first claims that HG learned the technique from
their farming neighbours, while the second includes a genetic admixture be-
tween these two communities [10]. To answer the question which theory is
more likely we estimate the migration rate m between HG and F populations
using the mitochondrial hypervariable segment I (HVS-I mtDNA) data from
four populations and a demographic model in Figure 3. Small values of m
correspond to the first theory and large values to the second. Details are
relegated to the Appendix; results show again a good agreement with those
obtained by other methods.
We tested the DN Model results using parametric bootstrap analysis
in which, first, we generated multiple samples using fastsimcoal2, then we

12
calculated the FST values for simulated populations, and, finally, we esti-
mated model parameters based on these simulation-based values using the
DN Model. In the fastsimcoal2 simulations we used population structures
exactly as defined in our models. We also used such sample sizes as they
are given by the genealogic data. Overall, we performed 20 simulations of
the model describing agricultural spread. The number of simulations is low
in this case because of the necessity to manually run the Nelder-Mead local
search method multiple times for each simulation.
We determined for each obtained set of the FST values the range of model
parameters values that minimizes the error function δ in a way explained in
the Appendix. In Figure 5A-D we present the values of four model parame-
ters, NA , NF , NHG and m, as function of the number of simulations (out of
20) in which the obtained range of the values of a given parameter contains
that parameter value. We notice that the functions obtained for the popula-
tion sizes have peaks near the values obtained directly from the DN Model.
The migration rate is more widely spread and its peak values are lower than
our estimate (equal to 0.4). Nonetheless, the results support the hypothesis
of nonzero migration rate between farmers and hunters-gatherers.

3.2.3. Example 3. Finnish admixture in Balts and Slavs.


In contrast to the previous examples, this one is, as far as we know, orig-
inal. It also requires developing a specialized version of the master equation
describing drift and mutation between demographic events that is applica-
ble to microsatellite loci (see further on). We use the demographic network
model to study the history of several Central Eastern European populations
based on Y chromosome data. We focus on relations between the Slavs occu-
pying the territory of the modern Poland and the Central Balts – ancestors of
the modern Lithuanians and Latvians. Both Proto-Slavs and Proto-Balts are
likely to have originated from nomadic tribes that had left the Indo-European
homeland and diversified into most of the modern European people. It is ar-
gued that these groups came to Europe in the 2nd millenium BC [46, 47, 48],
but the exact chronology is unclear [49]. Both Poles and Lithuanians are
genetically considerably distinctive [50] from other European populations.
However, analysis of the Y-chromosome haplogroups of the modern Eastern
Europeans shows that other ancient populations should also be considered
[51]. The Balto-Slavic branch of the Indo-European genealogy belongs to
the R1a haplogroup. However, about 45% of the population of the modern
Lithuania and Latvia belongs to the N1c1 haplogroup, which is the hap-

13
logroup of the non-Indo-European North-Eurasian Finns who entered Europe
during the Corded Ware period at the beginning of the 3rd millenium BC
[52]. Balts, after the split from other Indo-European groups, settled along the
south-eastern coast of the Baltic Sea and assimilated with the Finnic tribes
living there [53]. Our studies concern the influence the N1c1 haplogroup had
on the Balts and Slavs. The exact times of splits of Balts and Slavs, or Finns
from their ancestral populations are unknown and estimating these times is
one of the objective of current example. Another interesting aspect of the
Balt-Slav relationship are migrations between these two groups in the 6th
century, when Slavs appeared for the first time on the territory of Poland,
and in the 14th century, when the Commonwealth of Poland and Lithuania
(1384 - 1795) came into existence. Figure 4A illustrates the modeled de-
mographic scenario. We estimate the ancient population sizes of all groups
based on the assumption that human population growth rates are equal to
Kremer’s values [54] and assume that the population size between two ad-
jacent Kremer’s times changes according to an exponential function with a
single human generation assumed to last 25 years, with the resulting rates
listed on the left in Figure 4A. Following Barbujani’s group approach [44] we
assume that the effective to census population size ratio in humans is placed
between 0.3 [55] and 0.5 [56], and since considering the Y-STR haplotype
chromosome introduces a factor 1/4 to this ratio [57], we assume that the
effective population size is ten times smaller than the census data size. These
estimates are in apparent contradicition to the estimates of the recognized
estimates of the effective population size of modern humans, of the order of
104 − 105 [58]. However, in the current setting, it seems more appropriate to
consider the ”ecological” effective population sizes, as ref. [44] does.
We use the Slatkin’s RST [59] to quantify the distance between two pop-
ulations. We calculate the RST distance between Slavs and Balts based on
the data from 919 unrelated male Polish individuals sampled from six geo-
graphical regions of Poland and 297 Balt descendants (152 from Vilnius and
145 from Riga). Genetic data at nine microsatellite loci ( DYS19, DYS389I,
DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385a and DYS385b)
are considered. The data can be obtained from ref. [60] or [61]. We use the
Arlequin program [62] to obtain a normalized [63] value of the Slatkin’s dis-
tance. We obtain that the RST value for samples of Poles and Balts is equal
to 0.03862.Our aim is to adjust the parameters of the scenario in Figure 4A
to obtain a model explaining the obtained value of the RST distance.
Developing a version of the Lyapunov Equation, suitable for

14
microsatellite loci. It is known that under the Stepwise Mutation Model
(SMM), it is sufficient to model the differences in the number of tandem
repeats between loci rather than the absolute numbers [64]. Therefore, dis-
tributions Rab (k, k+i)(t) become infinite vectors (rab (i, t), i ∈ Z) with compo-
nents indexed by all integers and consecutive entries corresponding to these
differences. For practical application, we consider that the possible value of
the differences is finite, and unlikely to be high. The center entry rab (0, t) of
the vector corresponds to the case when no change of the number of tandem
repeats occurred. As a result, we replace the Lyapunov equation (4) by the
following specialized equation:

dR̂ab (s, t) s 1 δab


= −(va + vb ) 1 − − R̂ab (s, t) + Π − R̂ab (s, t) , (11)
dt 2 2s Na t
∞
where R̂ab (s, t) = i=−∞ si rab (i, t), is the probability generating function
of an integer-valued random variable equal to the difference in allele size
between individuals sampled from populations a and b (0 ≤ a, b < κi ) at time
t ∈ [ti , ti+1 ), where va and vb are mutation intensities in both populations, Π
is a vector of the same dimension as R̂(s, t) with the value of 1 at i = 0 and all
other entries equal to 0 and δab is the Kronecker delta. Detailed description
regarding obtaining expression (11) may be found in ref. [65] and more
recently in ref. [66]. Deeper mathematical foundations have been established
in ref. [67]. Given the values of R̂aa (s, t), R̂bb (s, t) and R̂ab (s, t), the average
sum of squared difference distance (RST distance) between populations a and
b results in the following formula:

2Vab (t) − Vaa (t) − Vbb (t)


RST = , (12)
Vab (t)
where Vxy (t) is the variance of the allele size differences in populations x and
y (x = y is admissible) given by the following formula:


∂ R̂xy (s, t) ∂ R̂xy (s, t) ∂ R̂xy (s, t)


2
Vxy (t) = i rxy (i, t) = + 1− (13)
i=−∞
∂s2 ∂s ∂s s↑1

Figure 4B-D presents estimates of the genetic distance between descen-


dants of Slavs and Balts for different times of splits and migration rates. The

15
required value of the Slatkin’s distance may be obtained by more than one set
of parameter’s values. Therefore, although the estimates provide information
about common history of modeled groups, they are not sufficient to deter-
mine the exact scenario. The estimates depend on the effective population
sizes, which cannot be precisely estimated, but recently developed methods
indicate improvement in this area, especially if the genetic data are of good
quality [30].
We also considered a scenario according to which the Balts were not
admixed by Finns (consider Figure 4A leaving out the Finns). The RST
distance is presented in Figure 4E, as a function of the Balt-Slav split time
T1 . It is seen that in this case the RST distance is dramatically lower than
that computed based on microsatellite data.
To verify our method of estimating the RST distance between populations
we compare our results to the results generated by fastsimcoal2 [33]. We con-
sider three models: i) our Slavs-Balts model with parameter values m = 0.2,
T1 = −1500, T2 = −3500 and T3 = −15000, ii) population with constant
size of 3000 experiencing split into two populations of sizes 2000 and 1000
which happened g generations before the present time (g varies between 100
and 1000), and iii) model from ii) with additional migration event occuring
100 generations after the split. In each serial coalescent experiment we gen-
erate 10 loci samples and average the RST values over 20 random coalescent
genealogies. In each case, the difference in the RST estimates between the
two methods is lower than the standard deviation of the serial coalescent
results. In model one in which we fitted the DN Model to the data-based
RST = 0.03862, we obtained from the serial coalescent method that the RST
value between Slavs and Balts in our Slavs-Balts demographic model is equal
to 0.0555 ± 0.0344 (Figure 4F). DN Model estimates calculated for the sec-
ond model for g = 1000 give RST = 0.456, while from fastsimcoal2 we obtain
0.396 ± 0.263. Introducing the migration event decreases the RST values to
0.336 using the DN Model and to 0.283 ± 0.211 for fastsimcoal2.
Additionally, we performed a bootstrap analysis of the Slavs-Balts model,
in which we assume that times of the splits estimated in the model are
equal to the previously obtained values (T1 = −1500, T2 = −3500 and
T3 = −15000) and under such assumptions we calculate migration rate m.
Figure 5E presents the cumulative number of occurences of that migration
rate, for a range of migration values discretized using a 0.01 step, in a sam-
ple of 100 simulations. The median value of the migration rate is equal to
0.17 and is close to the value estimated originally by DN Model (which was

16
0.2). Moreover, for most simulated demographies, the value of the estimated
migration rate is between 0.05 and 0.3.

4. Discussion
In this paper, we introduced mathematics and applications of the demo-
graphic networks. The main advantage of the DN Model is its ability to
efficiently compute the joint allelic distributions in pairs of haploid genomes
when more than two populations are involved. As a comparison, in the
framework of diffusion approximations, which provide another frequantly
used methodology (cf. Poisson random fields [68]) such computations fre-
quently seem less straighforward.
Returning to the mathematical core of the DN Model, we may list a
number of limitations. Our model as of now, concerns neutral evolution only
and does not involve recombination. In addition, the mathematics will be
substantially more complicated if distributions not of pairs, but of larger
samples of individuals, are modeled. The computational complexity in such
cases remains to be explored. Recombination has not been incorporated in
the algorithm, although it may be drawn into the picture for example by
extending the algorithm of Polanska and Kimmel [69, 19]. Finally, model-
ing of a wider range of different mutation patterns is to be perfected. This
concerns, among other, modeling of data from genome sequencing. Selection
will most likely be difficult to take into account. However, even modeling of
benchmark neutral scenarios may be a helpful alternative to existing simula-
tion or exact methods. This is especially true if the network contains many
populations.
We provided three examples of application of the DN Model. Example 1
concerns evolutionary and demographic scenarios linking Cro-Magnon peo-
ple, Modern Humans and Neanderthals which were previously considered by
Belle et al. [44], using the serial coalescent techniques. We show that de-
mographic networks produce similar results, as the serial coalescent (Figure
6E).
In Example 2, we study the genetic relationships between hunters-gatherers
and farmers in early Neolithic Europe based on the mtDNA genetic data ob-
tained from ancient human relics of both groups found on the territory of
modern Germany. Our results are consistent with a high migration rate
(m = 0.4) between those populations. This contradicts the alternative the-
ory claiming that a farming lifestyle in Europe spread solely be technological

17
transmission. Low, but slightly higher, values of the error function used in
our model were also obtained for m > 0.8, what suggests that the ancient
genetic material of hunters-gatherers might have been replaced by the farm-
ers almost completely. The general outcome is consistent with the results
obtained previously in ref. [70] and [10] using different approaches.
Example 3, which is original, concerns the little-known demographic ge-
netics of the peoples of Central Eastern Europe: Slavs, Balts and Finns.
Analysis of a demographic network model applied to Y-chromosome mi-
crosatellites indicates that it is necessary to consider the admixture of the
Proto-Finns into the Balts, to explain the genetic distance between these
latter and the Slavs (as represented here by Poles). This result is obtained
using the documented and hypothetical timings of historical and prehistori-
cal events and a search in parameter space. Despite a number of unknowns,
the assertion appears robust.
Benchmark results indicate the main areas of use of our model. Our
approach is useful in modeling the evolution of non-recombining genetic ma-
terial from many populations organised in a complex demographic network,
especially when the number of possible allele variants is small. This includes
using mtDNA or Y chromosome data represented either by short haplotype
sequences or microsatellite loci. In such cases our model can be applied for
complex scenarios.

Acknowledgments
A significant part of the research was conducted during Tomasz Wo-
jdyla’s visits in the Department of Statistics at the Rice University. This
work was supported by the Polish National Science Center grant: DEC–
2012/04/A/ST7/00353 (T.W.) and the NCBiR grant POIG 02.03.01-24-099.
(M.K.).

18
Figure 1: Benchmarks. (A) Influence of the allelic state space size on execution times
of DN Model. Single population with constant size, evolving over 10, 000 generations,
with different sizes of the allelic state space NA . The figure presents execution times of
the DN Model run on an average personal computer. (B) Influence of the number of
populations in the network on DN Model execution times. Ancestral population keeps
splitting into two populations after each 100 generations until it reaches the total number
of n populations. After that, all n populations evolve separately. The scenario covers
10, 000 generations of the network evolution. We consider two different sizes of the allelic
state space NA = 2 and NA = 32. Increasing the number of populations x times results in
an increase of the execution time x2 times. (C) Influence of the model time on execution
time of DN Model. Ancestral population keeps splitting into two populations after each
100 generations reaching the total number of 10 populations. After that, all populations
evolve separately for a given number of generations. Increasing the number of generation
10 times leads to doubling of the execution time.

Figure 2: Simple example - linkage disequillibrium. The Lewontin’s index in a


constant size population (D00 
), an exponentially growing population (D11
) and between
these two populations (D10 ). Both populations evolve from a common ancestral popula-


tion; the split event occurs at time t = 0. The ancestral population is assumed to be in a
mutation-drift equilibrium.

Figure 3: Farmer-Hunter-Gatherer demographic model. The model includes:


Eurasian ancestors (A), ancient hunters-gatherers (HG), ancient farmers from LBK cul-
ture (LBK), modern Central Europeans (CE) and modern Near Easterners (NE). The
values listed on the left indicate growth rates per generation with a single generation of 25
years. Two bottleneck events: (i) European Paleolithic hunters-gatherers ancestors and
(ii) European farmers ancestors leaving the Indo-European homeland. Genetic transmis-
sion between HG and farmers is modeled by migration rate m1 +m2 , where m1 is migration
rate before the LBK graveyard, and m2 migration rate after that. For an explanation of
the growth patterns and rates depicted, c.f. Example 2 in the Appendix.

19
Figure 4: Balts-Slavs-Finns example. (A) Demographic model. The model includes:
Eurasian Ancestors (EA), Indo-Europeans (I), Finns (F), Slavs (S), Balts (B) and Poles
(P). Population sizes before the year 1AC are estimated using Kremer [54] rates. Values
listed on the left are growth rates per generation with a generation of 25 years. Two
bottleneck events indicate: (i) S and B tribes leaving the Indo-European homeland and
(ii) emergence of P from S. The size of the bottlenecks only slightly changes the value of
RST provided that the population after the bottleneck is at most 1/3 of the original size.
We assume the following values of the bottleneck size ratios: 1/5 in the case of SB and 1/7
for P. This is also consistent with neglecting the bottleneck when I and F split off from
the ancestral EA. Four parameters are varied: (i) T1 – time of split of B and S, (ii) T2 –
time of BS leaving the Indo-European homeland (iii) T3 – time of split of I and F, and (iv)
m – migration rate between B and P. The exact time of the migration does not influence
the RST value. (B-D) RST distance between Slavs and Balts as a function of: (B) the
Balt-Slav split time T1 , (C) the time T2 when Balts and Slavs left the Indo-European
homeland and (D) the Finn-Indo-European split time T3 . Populations evolve according
to the model presented in panel (A) with fixed times set to T1 = −1500, T2 = −3500
and T3 = −15000. Results for four different values of the migration rate m are depicted.
Dotted line is the data-based estimate of RST . (E) Impact of Proto-Finns on the RST
distance between Slavs and Balts. RST distance between Slavs and Balts as a function of
the Balt-Slav split time T1 . Populations evolve according to the model presented in panel
(A) with T2 = −3500 but without Proto-Finns population. Results for four different values
of the migration rate m are depicted. Discarding Proto-Finns substantially decreases RST
estimates. (F) Empirical cumulative distribution (ECD) of the fastsimcoal2 Monte-Carlo
estimates of RST distance between Slavs and Balts. 20 experiments estimating the RST
distance between Slavs and Balts according to the demographic model of panel (A) have
been run using fastsimcoal2. The figure presents empirical cumulative distribution of the
results. Averaging over all exeriments we obtain that RST = 0.0555 ± 0.0344. Model
distance is equal to 0.03862.

Figure 5: Bootstrap analysis of the obtained results. (A-D) Bootstrap analysis of the
agricultural spread model. 20 demographies have been generated using fastsimcoal2 ac-
cording to the demographic model in Figure 3. The Nelder-Mead algorithm has been used
repeatedly for each demography and 4 model parameter values have been estimated (each
parameter value in each experiment is described by the range). The 4 model parameters
are: ancient population size NA , farmer population size NF , hunter-gatherer population
size NHG and migration rate between farmers and hunters-gatherers. The figure presents,
for each of those parameters, number of demographies supporting particular parameter
value. (E) Bootstrap analysis of the Balts-Slavs-Finns model. 100 demographies have been
generated using fastsimcoal2 according to the demographic model in Figure 4A. Then, DN
Model has been used for each demography to estimate migration rate between Slavs and
Balts. Figure presents empirical cumulative distribution of demographies supporting mi-
gration rates discretized using 0.01 step on the horizontal axis.

20
Figure 6: Europeans - Cro-Magnoids - Neanderthals example (A-D) Types of joint
demographic models of modern Europeans (M) - Cro-Magnoid (CM) - Neanderthals (N
and N’). The numbers at the left are times in human generations (25 years per generation)
counted backwards from the present. The models used in the paper are: L1.1, L1.2 and
H1.1 (A), L1.5 (B), L1.3, L1.4, H1.2 and H1.3 (C), L1.7 (D) [44]. The detailed values of the
growth rates and population sizes for each model may be found in [44]. (E) Comparison
of two methods: serial coalescent and demographic networks. The figure presents pairwise
difference obtained by applying serial coalescent and DN Model to three populations: Cro-
Magnoid (full circles), Neanderthal (open circles) and modern European (crosses) under
different demographic scenarios (listed in the legend and explained in ref. [44]). Infinite-
site model, approximated by haplotypes of 360 nucleotides, was assumed.

Table 1: Pairwise difference calculations between Cro-Magnoid, Neanderthal


and Modern European populations under different demographic scenarios. The
table presents the values of pairwise difference obtained by applying two different methods
(serial coalescent and demographic networks) to three populations: Cro-Magnoid (CM),
Neanderthal (N) and modern European (M) under different demographic scenarios (ex-
plained in ref. [44]). Infinite-site model, approximated by haplotype sequences consisting
of 360 nucleotides, was assumed. Serial coalescent approach required a fixed sample size to
be specified for each population (N – 6, CM – 2 and M – 558). The table presents median
values of the pairwise difference for serial coalescent and mean values for demographic
networks. Also, see Figure 6E.
Method Population Demographic model
L1.1 L1.2 L1.3 L1.4 L1.5 L1.7 H1.1 H1.2 H1.3
Serial N 1.9 1.9 0.9 1.5 12.9 3.9 15 7 4.7
coalescent CM 1 1 1 1 11 4 12 10 7
[44] M 1.7 2.4 1.9 2.3 13.6 4.7 18.6 14.6 13
Demographic N 2.2 2.2 1.1 1.7 17.1 1.9 20 7.8 5.2
networks CM 2.2 2.2 1.4 1.8 17.3 2.6 20 11.3 8.5
M 2.2 2.8 2.1 2.6 17.9 3.4 25.5 17.9 15.6

21
Table 2: FST distances for several Near Eastern and European populations. The
table contains the values of the FST distance among four populations: ancient hunters-
gatherers (HG) found on the territory of modern Germany, ancient farmers (LBK) from
graveyard in Derenburg in Germany, modern Central Europeans (CE) and modern Near
Easterners (NE). The values are estimated using mtDNA genetic data from the HVS-I
region.
HG LBK CE NE
HG *
LBK 0.09298 *
CE 0.03445 0.03958 *
NE 0.04192 0.03019 0.00939 *

Appendix A.
Example 1
The models used in our calculations are taken from the Figure 1 in ref.
[44]. All the models assume that the Neanderthals (N) and Cro-Magnoids
(CM) lived 1700 and 960 generations ago, respectively, with generation time
assumed equal to 25 years. All the models except L1.7 assume a single popu-
lation with different growth rates. Model L1.1 is a constant size population.
In L1.2, the population grows after the origin of CM. Models L1.3 and L1.4
introduce to L1.2 a small growing rate before origin of N; rapid expansion
from CM to Moderns (M) is assumed in L1.4. In L1.5 the population grows
to the same large size as in L1.4, but with more balanced growth rates in
each time interval. L1.7 models the population from L1.5 with an assump-
tion that there existed a separate shrinking N population. Models labeled
by H have assumed mutation rate ten times larger (0.5 per million years per
nucleotide instead of 0.05 as in models staring with L). Demography of H1.1
is the same as in L1.2. Demographies of H1.2 and H1.3 are slightly different
versions of L1.3. We summarize all of these demographies in Figure 6A-D.
For more details see ref. [44]. Both models assume infite site mutation model
approximated by a long haplotype sequence of 360 nucleotides with two pos-
sible variants at each position. Given the joint distribution Rxx (t) we can
calculate the mean pairwise difference ϕx (t) in population x according to the
following formula:


ϕx (t) = d(i, j)Rxx (i, j)(t), (A.1)
i,j∈A

22
where d(i, j) is the number of positions at which the sequences of allelic
types i and j differ. Table 1 and Figure 6E compare our results to those
obtained by Belle and co-workers in ref. [44] using serial coalescent approach.
Discrepancies between simulation and our methods, which are largest for the
CM population, are as expected considering a very small CM sample size
(only 2 individuals), and the fact that median (not mean) value was listed
by Belle et al. The sample sizes of N and M populations used in the serial
coalescent approach were equal to 6 and 558, respectively.

Example 2
The data we use are predominantly from ref. [10], complemented by other
sources. As Hunter-Gatherer (HG) population we used 20 individuals of post-
LGM (Last Glacial Maximum) hunters-gatherers with mtDNA haplogroup
U found on the territory of modern Germany [70]. It is assumed that these
individuals lived about 13,000-10,000 years ago. The genetic data of the
Farmer (F) population were mostly obtained from the Derenburg graveyard
in Germany [71, 10] and consist mtDNA samples of 42 farmers from Neolithic
Linear Pottery Culture (LBK). The age of the graveyard has been estimated
to be about 7,200 years. The area of the LBK consists of territories close to
the middle Danube, the upper and middle Elbe, and the upper and middle
Rhine. Two remaining populations are 1030 modern Central Europeans (CE)
from the LBK core area and 737 modern Near Easterners (NE) from Anatolia.
The values of the FST distance between each pair of populations have been
obtained from [70] and [10] and are summarized in Table 2.
We assume that before the LGM the ancestors of HG had separated
from the common ancestral population (A) and migrated to Europe (Figure
3). Then, about 11,000 years ago the farmers started to spread in Europe
mixing with the HG individuals at the total migration rate m = m1 + m2 ,
where m1 stands for migration before the time of graveyard in Derenburg
and m2 after this time. Post-neolithic demography is modeled by CE-NE
admixture at rate M at the present time with the fraction of the proper
mtDNA haplogroup U distribution in modern Europe assumed to be equal to
15%. Population size growth rates before the year 2000 BP are as estimated
by Kremer [54], and after that year we assume exponential growth of the
haplogroup U carriers up to the 15% of the CE and NE population sizes
at the present time. As a mutation model we use a simple one-base two-
allele haplotype model with recurrent mutations and mutation rate equal to
3 × 10−5 per base per generation [72]. We also assume that the ancestral

23
population was in a drift-mutation equilibrium during the time of split of
the HG population. The value of the FST at time t between populations
x and y may be calculated using the joint distributions Rxx (t), Rxy (t) and
Ryy (t) as follows:

Rxx (a, A)(t) + Rxx (A, a)(t) + Ryy (a, A)(t) + Ryy (A, a)(t)
FST = , (A.2)
2(Rxy (a, A)(t) + Rxy (A, a)(t))

where a and A are the allele variants.


Overall, we have 6 parameters in the model (NA , NHG , NF , M, m and
m1 ). We use the Nelder-Mead local search method [73] to evaluate demo-
  (i)
FST
graphic parameters minimizing the error function δ = 6i=1 (ln( FST (i)
))2 ,

where the six FST (i) values are these from Table 2, and FST (i) are the
values obtained from the model. We obtain that the lowest value of δ
(δ < 0.04) is achieved for m = 0.40 with other parameters ranging as fol-
lows: NA = [5, 000; 50, 000], NF = [2, 500; 3, 000], NHG = [1, 700; 1, 850],
NM = [10, 000, 000; 15, 000, 000] and M = 0.37. The overall result from our
model is similar to that published in ref. [10] using Bayesian Serial Sim-
coal [74] and it supports the genetic admixture theory. Our estimates of NF
and NHG agree with the most probable values estimated in ref. [70] using
coalescent simulations.

24
References
[1] R. E. Green, J. Krause, A. W. Briggs, T. Maricic, U. Stenzel, et al., A
draft sequence of the Neandertal genome, Science 328 (2010) 710–722.
doi:10.1126/science.1188021.
[2] M. Rasmussen, Y. Li, S. Lindgreen, J. S. Pedersen, A. Albrechtsen,
et al., Ancient human genome sequence of an extinct Palaeo-Eskimo,
Nature 463 (2010) 757–762. doi:10.1038/nature08835.
[3] M. Meyer, M. Kircher, M.-T. Gansauge, H. Li, F. Racimo, et al., A
high-coverage genome sequence from an archaic Denisovan individual,
Science 338 (2012) 222–226. doi:10.1126/science.1224344.
[4] J. Hey, C. A. Machado, The study of structured populations – new hope
for a difficult and divided science, Nature Reviews Genetics 4 (2003)
535–543. doi:10.1038/nrg1112.
[5] M. Stoneking, J. Krause, Learning about human population history from
ancient and modern genomes, Nature Reviews Genetics 12 (2011) 603–
614. doi:10.1038/nrg3029.
[6] M. A. Beaumont, B. Rannala, The Bayesian revolution in genetics, Na-
ture Reviews Genetics 5 (2004) 251–261. doi:10.1038/nrg1318.
[7] A. Wollstein, O. Lao, C. R. Becker, S. Bauer, R. J. Trent, et al., De-
mographic history of Oceania inferred from genome-wide data, Current
Biology 20 (2010) 1983–1992. doi:10.1016/j.cub.2010.10.040.
[8] J. Hox, Multilevel Analysis: Techniques and Applications, Lawrence
Erlbaum Associates, New Jersey, 2002.
[9] M. Arenas, J. S. Lopes, M. A. Beaumont, D. Posada, Codabc:
A computational framework to coestimate recombination, substitu-
tion, and molecular adaptation rates by approximate Bayesian com-
putation, Molecular Biology and Evolution 32(4) (2015) 1109–1112.
doi:10.1093/molbev/msu411.
[10] W. Haak, O. Balanovsky, J. J. Sanchez, S. Koshel, V. Zaporozhchenko,
et al., Ancient DNA from European Early Neolithic farmers reveals
their Near Eastern affinities, PLoS Biology 8(11) (2010) e1000536.
doi:10.1371/journal.pbio.1000536.

25
[11] G. Hellenthal, A. Auton, D. Falush, Inferring human colonization his-
tory using a copying model, PLoS Genetics 4(5) (2008) e1000078.
doi:10.1371/journal.pgen.1000078.

[12] D. Reich, R. E. Green, M. Kircher, J. Krause, N. Patterson, et al.,


Genetic history of an archaic hominin group from Denisova Cave in
Siberia, Nature 468 (2010) 1053–1060. doi:10.1038/nature09710.

[13] M. K. Kuhner, Coalescent genealogy samplers: windows into popu-


lation history, Trends in Ecology and Evolution 24(2) (2009) 86–93.
doi:10.1016/j.tree.2008.09.007.

[14] S. Hoban, G. Bertorelle, O. E. Gaggiotti, Computer simulations: tools


for population and evolutionary genetics, Nature Reviews Genetics 13
(2012) 110–122. doi:10.1038/nrg3130.

[15] S. Neuenschwander, F. Hospital, F. Guillaume, J. Goudet, quantinemo:


an individual-based program to simulate quantitative traits with explicit
genetic architecture in a dynamic metapopulation, Bioinformatics 24
(2008) 1552–1553. doi:10.1093/bioinformatics/btn219.

[16] B. W. Lambert, J. D. Terwilliger, K. M. Weiss, Forsim: a


tool for exploring the genetic architecture of complex traits
with controlled truth, Bioinformatics 24 (2008) 1821–1822.
doi:10.1093/bioinformatics/btn317.

[17] B. Peng, M. Kimmel, simupop: a forward-time population ge-


netics simulation environment, Bioinformatics 21 (2005) 3686–3687.
doi:10.1093/bioinformatics/bti584.

[18] A. Bobrowski, M. Kimmel, An operator semigroup in mathematical ge-


netics, Springer, 2015.

[19] A. Bobrowski, T. Wojdyla, M. Kimmel, Asymptotic behavior of


a Moran model with mutations, drift and recombination among
multiple loci, Journal of Mathemarical Biology 61 (2010) 455–473.
doi:10.1007/s00285-009-0308-1.

[20] R. N. Gutenkunst, R. D. Hernandez, S. H. Williamson, C. D. Busta-


mante, Inferring the joint demographic history of multiple populations

26
from multidimensional SNP frequency data, PLoS Genetics 5(10) (2009)
e1000695. doi:10.1371/journal.pgen.1000695.

[21] A. J. Drummond, A. Rambaut, BEAST: Bayesian evolutionary analysis


by sampling trees, BMC Evol Biol 7 (2007) 214–214. doi:10.1186/1471-
2148-7-214.

[22] R. Nielsen, J. Wakeley, Distinguishing migration from isolation: a


Markov chain Monte Carlo approach, Genetics 58 (2001) 885–896.

[23] J. Hej, R. Nielsen, Integration within the Felsenstein equation


for improved Markov chain Monte Carlo methods in population
genetics, Proc. Natl. Acad. Sci. U. S. A 104 (2007) 2785–290.
doi:10.1073/pnas.0611164104.

[24] R. Nielsen, M. J. Hubisz, I. Hellmann, D. Torgerson, et al., Darwinian


and demographic forces affecting human protein coding genes, Genome
Res 119(5) (2009) 838–849. doi:10.1101/gr.088336.108.

[25] A. M. Adams, R. R. Hudson, Maximum-likelihood estimation of


demographic parameters using the frequency spectrum of unlinked
single-nucleotide polymorphisms, Genetics 168(3) (2004) 1699–1712.
doi:10.1534/genetics.104.030171.

[26] M. K. Kuhner, LAMARC 2.0: maximum likelihood and bayesian es-


timation of population parameters, Bioinformatics 22 (2006) 768–770.
doi:10.1093/bioinformatics/btk051.

[27] P. Beerli, J. Felsenstein, Maximum-likelihood estimation of effective


population numbers in two populations using a coalescent approach,
Genetics 152 (1999) 763–773.

[28] C. Becquet, M. Przeworski, A new approach to estimate parameters


of speciation models with application to apes, Genome Res 17 (2007)
1505–1519. doi:10.1101/gr.6409707.

[29] S. F. Schaffner, C. Foo, S. Gabriel, D. Reich, et al., Calibrating a co-


alescent simulation of human genome sequence variation, Genome Res
15 (2005) 1576–1583. doi:10.1101/gr.3709305.

27
[30] H. Li, R. Durbin, Inference of human population history from
individual whole-genome sequences, Nature 475 (2011) 493–496.
doi:10.1038/nature10231.

[31] G. Laval, L. Excoffier, SIMCOAL 2.0: a program to simulate ge-


nomic diversity over large recombining regions in a subdivided pop-
ulation with a complex history, Bioinformatics 20 (2004) 2485–2487.
doi:10.1093/bioinformatics/bth264.

[32] L. Excoffier, M. Foll, fastsimcoal: a continuous-time coales-


cent simulator of genomic diversity under arbitrarily complex
evolutionary scenarios, Bioinformatics 27(9) (2011) 1332–1334.
doi:10.1093/bioinformatics/btr124.

[33] L. Excoffier, I. Dupanloup, E. Huerta-Sanchez, V. C. Souza, et al., Ro-


bust demographic inference from genomic and SNP data, PLoS Genetics
9(10) (2013) e1003905. doi:10.1371/journal.pgen.1003905.

[34] A. Bobrowski, M. Kimmel, O. Arino, R. Chakraborty, A semigroup rep-


resentation and asymmetric behavior of certain statistics of the Fisher-
Wright-Moran coalescent, Handbook of Statistics 19 (2001) 215–242.

[35] G. Grimmett, D. Stirzaker, Probability and RandomProcesses, 3rd edi-


tion, Oxford University Press, New York, 2001.

[36] J. Kingman, The coalescent, Stochastic Processes and Their Applica-


tions 13 (1982) 235–248. doi:10.1016/0304-4149(82)90011-4.

[37] A. Pazy, Semigroups of linear operators and applications to partial dif-


ferential equations, Springer, New York, 1983.

[38] Z. Gajic, M. Tahir, J. Qureshi, Lyapunov matrix equation in system


stability and control, Academic Press, San Diego, 1995.

[39] W. C. Gear, Numerical Initial Value Problems in Ordinary Differential


Equations, Prentice-Hall, 1971.

[40] G. E. Forsythe, M. A. Malcolm, C. B. Moler, Computer methods for


mathematical computations, Prentice-Hall, 1977.

28
[41] W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, Numer-
ical recipes in C - the art of scientific computing, Cambridge University
Press, 1988.

[42] J. R. Cash, A. H. Karp, A variable order Runge-Kutta method for initial


value problems with rapidly varying right-hand sides, ACM Transactions
on Mathematical Software 16(3) (1990) 201–222.

[43] R. C. Lewontin, The interaction of selection and linkage. General con-


siderations; heterotic models, Genetics 49(1) (1964) 49–67.

[44] E. M. S. Belle, et al., Comparing models on the genealogical


relationships among Neandertal, Cro-Magnoid and modern Euro-
peans by serial coalescent simulations, Heredity 102 (2009) 218–225.
doi:10.1038/hdy.2008.103.

[45] P. Skoglund, et al., Origins and genetic legacy of Neolithic farm-


ers and hunter-gatherers in Europe, Science 336 (2012) 466–469.
doi:10.1126/science.1216304.

[46] B. V. Gornung, Iz predystorii obrazovaniia obshcheslavianskogo


iazykovogo edinstva, Izd-vo Akademii nauk SSSR, Moskva, 1963.

[47] A. Borzyskowski, The Slavic ethnogenesis,


http://www.andrzejb.net/slavic/.

[48] Nostratic group, Indo-European chronology,


http://indoeuro.bizland.com/project/chron/chron1.html.

[49] C. McEvedy, J. Woodcock, The New Penguin Atlas of Ancient History,


Penguin, 2002.

[50] R. Ploski, et al., Homogeneity and distinctiveness of Polish paternal lin-


eages revealed Y chromosome microsatellite haplotype analysis, Human
Genetics 110 (2002) 592–600. doi:10.1007/s00439-002-0728-0.

[51] E. group, Origins, age, spread and ethnic association of european hap-
logroups and subclades, http://www.eupedia.com/genetics/.

[52] M.-O. Baldia, The Corded Ware / Single Grave Culture,


http://www.comp-archaeology.org/CordedWare.htm (2006).

29
[53] H. L. Thomas, Archaeology and Indo-European comparative linguistics,
Reconstructing languages and cultures 58 (1992) 281–316.

[54] M. Kremer, Population growth and technological change: One million


B.C. to 1990, The Quarterly Journal of Economics 108(3) (1993) 681–
716. doi:10.2307/2118405.

[55] L. B. Jorde, The genetic structure of subdivided human populations: a


review; in: Current developments in anthropological genetics. Volume
1: Theory and methods (James H. Mielke and Michael H. Crawford,
eds), Plenum Press, New York, 1980.

[56] L. Nunney, The influence of mating system and overlapping genera-


tions on effective population size, Evolution 47(5) (1993) 1329–1341.
doi:10.2307/2410151.

[57] A. Perez-Lezaun, et al., Population genetics of T-chromosome short tan-


dem repeats in humans, Journal of Molecular Evolution 45(3) (1997)
265–270.

[58] H. Harpending, A. Rogers, Genetic perspectives on human origins and


differentiation, Annual review of genomics and human genetics 1 (2000)
361–385. doi:10.1146/annurev.genom.1.1.361.

[59] M. Slatkin, A measure of population subdivision based on microsatellite


allele frequencies, Genetics 139 (1995) 457–462.

[60] M. Kayser, et al., Evaluation of Y-chromosomal STRs: a multicenter


study, International Journal of Legal Medicine 110 (1997) 125–129.

[61] B. Institute of Legal Medicine, Humboldt-University, European Y-


chromosome microsatellite data, http://www.ystr.org.

[62] L. Excoffier, G. Laval, S. Schneider, Arlequin, an inter-


grated software package for population genetics data analysis,
http://http://cmpg.unibe.ch/software/arlequin3.

[63] S. J. Goodman, RST Calc: a collection of computer programs for cal-


culating estimates of genetic differentiation from microsatellite data
and determining their significance, Molecular Ecology 6 (1997) 881–885.
doi:10.1111/j.1365-294X.1997.tb00143.x.

30
[64] M. Kimmel, R. Chakraborty, Measures of variation at DNA repeat loci
under a General Stepwise Mutation Model, Theoretical Population Bi-
ology 50(3) (1996) 345–367. doi:10.1006/tpbi.1996.0035.

[65] M. Kimmel, et al., Signatures of population expansion in microsatellite


repeat data, Genetics 148 (1998) 1921–1930.

[66] B. Li, M. Kimmel, Factors influencing ascertainment bias of microsatel-


lite allele sizes: impact on estimates of mutation rates, Genetics 195
(2013) 563–572. doi:10.1534/genetics.113.154161.

[67] A. Bobrowski, M. Kimmel, Asymptotic behavior of joint distributions of


characteristics of a pair of randomly chosen individuals in discrete-time
Fisher-Wright models with mutations and drift, Theoretical Population
Biology 66(4) (2004) 355–367.

[68] M. M. Desai, J. B. Plotkin, Detecting directional selection from the


polymorphism frequency spectrum, arXiv preprint arXiv:0707.2428.

[69] M. Kimmel, J. Polanska, A model of dynamics of mutation, genetic


drift and recombination in DNA-repeat genetic loci, Archives of Control
Sciences 9(XVL) (1999) 143–157.

[70] B. Bramanti, et al., Genetic discontinuity between local hunter-


gatherers and Central Europe’s first farmers, Science 326 (2009) 137–
140. doi:10.1126/science.1176869.

[71] W. Haak, et al., Ancient DNA from the first European farm-
ers in 7500-year-old Neolithic sites, Science 310 (2005) 1016–1018.
doi:10.1126/science.1118725.

[72] N. Howell, I. Kubacka, D. A. Mackey, How rapidly does the human


mitochondrial genome evolve?, Am J Hum Genet 59 (1996) 501–509.

[73] J. A. Nelder, R. Mead, A Simplex Method for function minimization,


Oxford Journals Mathematics & Physical Sciences Computer Journal
7(4) (1965) 308–313. doi:10.1093/comjnl/7.4.308.

[74] Y. L. Chan, C. N. Anderson, E. A. Hadly, Bayesian estimation of the


timing and severity of a population bottleneck from ancient DNA, PLoS
Genetics 2 (2006) e0020059. doi:10.1371/journal.pgen.0020059.

31
Figure 1
Click here to download high resolution image
Figure 2
Click here to download high resolution image
Figure 3
Click here to download high resolution image
Figure 4
Click here to download high resolution image
Figure 5
Click here to download high resolution image
Figure 6
Click here to download high resolution image
      

  
   
         
 


  
       
            

          
      
 

You might also like