Professional Documents
Culture Documents
Abstract
Background We tackle the problem of estimating species TMRCAs (Time to Most Recent Common Ancestor),
given a genome sequence from each species and a large known phylogenetic tree with a known structure (typi-
cally from one of the species). The number of transitions at each site from the first sequence to the other is assumed
to be Poisson distributed, and only the parity of the number of transitions is observed. The detailed phylogenetic tree
contains information about the transition rates in each site. We use this formulation to develop and analyze multiple
estimators of the species’ TMRCA. To test our methods, we use mtDNA substitution statistics from the well-established
Phylotree as a baseline for data simulation such that the substitution rate per site mimics the real-world observed
rates.
Results We evaluate our methods using simulated data and compare them to the Bayesian optimizing software
BEAST2, showing that our proposed estimators are accurate for a wide range of TMRCAs and significantly outperform
BEAST2. We then apply the proposed estimators on Neanderthal, Denisovan, and Chimpanzee mtDNA genomes
to better estimate their TMRCA with modern humans and find that their TMRCA is substantially later, compared to val-
ues cited recently in the literature.
Conclusions Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree
(Phylotree), eliminating the requirement to reconstruct a tree encompassing the specific sequences of interest.
Moreover, they demonstrate notable improvement in both running speed and accuracy compared to BEAST2, par-
ticularly for earlier TMRCAs like the human-Chimpanzee split. Our results date the human – Neanderthal TMRCA to be
∼ 408, 000 years ago, considerably later than values cited in other recent studies.
Keywords Divergence times, Time to most recent common ancestor (TMRCA), Mitochondrial DNA (mtDNA), BEAST2,
Transition rates, Poisson parity, Ancient humans, Neanderthal, Denisovan, Chimpanzee
Background
Dating species divergence has been studied extensively
for the last few decades using approaches based on
genetics, archaeological findings, and radiocarbon dat-
ing [1, 2]. Finding accurate timing is crucial in analyz-
*Correspondence: ing morphological and molecular changes in the DNA,
Saharon Rosset in demographic research, and in dating key fossils. One
saharon@tauex.tau.ac.il
1
Department of Statistics and Operations Research, School approach for estimating the divergence times is based on
of Mathematical Sciences, Tel-Aviv University, Tel‑Aviv 6997801, Israel the molecular clock hypothesis [3, 4] which states that
the rate of evolutionary change of any specified protein
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecom-
mons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 2 of 11
is approximately constant over time and different line- a broad set of unknown parameters averaging the error
ages. Subsequently, statistical inference can be applied to on all of them (the tree structure, the timing of every
a given phylogenetic tree to infer the dating of each node node, the per-site substitution rates, etc.). Subsequently,
up to calibration. the resources they require for finding a locally optimal
Our work focuses on this estimation problem and pro- instantiation of the tree and dating all its nodes can be
poses new statistical methods to date the TMRCA in a very high in terms of memory and computational com-
coalescent tree of two species given a detailed phyloge- plexity. Consequently, the number of sequences they can
netic tree for one of the species with the same transition consider simultaneously is highly limited. Thus, unlike
rates per site. Our work does not detect introgression previous solutions, we utilize transition statistics from all
events, and in cases of introgression [5] should be used available sequences, in the form of a previously built phy-
alongside methods for introgression detection (e.g. [6]). logenetic tree.
We note that dating the TMRCA in a coalescent tree is We develop a novel approach to simulate realistic data
different than finding the population tree divergence to test our proposed solutions. To do so, we employ Phy-
time. A discussion regarding the differences between the lotree [14, 15] – a complete, highly detailed, constantly
two is available in [7, 8]. Specifically, the discordance of updated reconstruction of the human mitochondrial
nuclear and mtDNA histories [9] suggests the coalescent DNA phylogenetic tree. In our work we assume all sub-
tree and population tree of humans may have a different stitutions are specified by Phylotree (for an elaborate dis-
topology. cussion regarding the correction of this claim see [16]).
We formulate the problem of dating the TMRCA by We sample transitions of similar statistics to Phylotree
modeling the number of transitions ( A ↔ G, C ↔ T ) and use it to simulate a sequence at a predefined trajec-
in each site using a Poisson process with a different rate tory from Phylotree’s root.
per site; sites containing transversions are neglected due We then empirically test the different estimators on
to their sparsity (indeed, we include sparse transversions simulated data and compare our results to the BEAST2
in the simulations and show that our methods are robust software. Our proposed estimators are calculated sub-
to their occurrences). The phylogenetic tree is used for stantially faster while utilizing the transitions statistics
estimating the transition rates per site. Hence, when con- from all available sequences (Phylotree considers 24,275
sidering two representative sequences, one from each sequences), unlike BEAST2 which can consider only
population, our problem reduces to a binary sequence dozens of sequences due to its complexity. Comparing
where the parity of the number of transitions of each site with the ground truth, we show that BEAST2 slightly
is the relevant statistic from which we can infer the time overestimates the TMRCA, while our estimates pro-
difference between them. vide more accurate results. For larger TMRCAs such as
We can roughly divide the approaches to solving this the human-Chimpanzee, BEAST2 also has a larger vari-
problem into two. The frequentist approach seeks to ance compared to our methods. Finally, we use our esti-
maximize the likelihood of the observed data. Most mators to date the TMRCA (given in kya – kilo-years
notable is the PAML [10] package of programs for phy- ago) of modern humans with Neanderthals, Denisovan
logenetic analyses of DNA and the MEGA software [11]. and Chimpanzee based on their mtDNA. Surprisingly,
Alternatively, the Bayesian approach considers a prior of the TMRCAs we find (human-Neanderthals ∼408 kya,
all the problem’s parameters and maximizes the posterior human-Denisovans ∼841 kya, human-Chimpanzee ∼
distribution of the observations. Leading representatives 5,010 kya) – are considerably later than those accepted
of the Bayesian approach are BEAST2 [12] and MrBayes today.
[13], which are publicly available programs for Bayesian
inference and model choice across a wide range of phylo- Methods
genetic and evolutionary models. Estimation methods
In this work, we developed several distinct estimators First, we describe an idealized reduced mathematical
from frequentist and Bayesian approaches to find the formulation for estimating TMRCAs and our proposed
TMRCA directly. The proposed estimators differ in their solutions. In Estimating ancient TMRCAs using a large
assumptions on the generated data, the approximations modern phylogeny section, we describe the reduction
they make, and their numerical stability. We explain each process in greater detail.
estimator in detail and discuss its properties. Consider the following scenario: we have a set of n
A critical difference between our proposed solutions Poisson rates, denoted as {i }ni=1 where n ∈ N. Let X be a
and existing methods is that we seek to estimate only vector of length n such that each element Xi is indepen-
one specific problem parameter. At the same time, soft- dently distributed as Pois(i ). Similarly, let Y be a vector
ware packages such as BEAST2 and PAML optimize over of length n such that each element Yi is independently
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 3 of 11
distributed as Pois(i · p) for a fixed unknown p. We Theorem 1 Denote the Fisher information matrix
denote Z as the coordinate-wise parity of Y , meaning that for the estimation problem above by I ∈ R(n+1,n+1),
Zi = 1 if Yi is odd and Zi = 0 otherwise. Our goal is to where the first n indexes correspond to {i }ni=1 and the
estimate p given X and Z. last index (n + 1) corresponds to p. For clarity denote
. . .
Ip,p = In+1,n+1 , Ii,p = Ii,n+1 , Ip,i = In+1,i . Then:
Remark 1 ∀i � = j, 1 ≤ i, j ≤ n : Ii,j = 0,
1
Ii,i =
4p2
+ 4 p ,
i e i −1
Note that the number of unknown Poisson rate param- Ii,p = Ip,i
4pi
= 4 p ,
eters n in the problem {i }ni=1 grows with the number of e i −1
observations {(Xi , Zi )}ni=1. However, our focus is solely on n
2i
Ip,p = 4 .
estimating p, so additional observations do provide more e ip − 1
4
i=1
information. (3)
Consequently, an unbiased estimator p̂ holds:
Remark 2
n −1
2i
The larger the value of p · i , the less information on p is 2
E (p − p̂) ≥ 4 . (4)
provided in Zi as it approaches a Bernoulli distribution e4i p − 1 + 4p2 i
i=1
with a probability of 0.5. On the other hand, the smaller i
is, the harder it will be to infer i from Xi . As a result, the If ∀i = 1..n : i = , we can further simplify the
problem of estimating p should be easier in settings where expression:
i is high and p is low. e4p − 1 + 4p2
E (p − p̂)2 ≥ . (5)
4n2
The CRB, despite its known looseness in many prob-
Preliminaries
lems, provides insights into the sensitivity of the error
First, we derive the distribution of Zi ; All proofs are pro-
to each parameter. This expression supports our previ-
vided in the Supplementary material (Section 1).
ous observation that the error of an unbiased estima-
tor increases exponentially with mini {i · p}. However,
Lemma 1 Let Y ∼ Pois(�) and Z be the parity of Y.
for constant i · p , the error improves for higher values
Then Z ∼ Ber( 21 (1 − e−2� )).
of i . We now proceed to describe and analyze several
estimators for p.
We use this result to calculate the likelihood and log-
likelihood of p and given X and Z . The likelihood is
given by:
Method 1 ‑ maximum likelihood estimator
n
i Xi 1
L X, � p, � =
� Z; e−i 1 + (−1)Zi e−2i p , (1)
i=1
Xi ! 2 Proposition 1 Following equation 1, the maximum
likelihood estimators p̂, ˆ i hold:
and the log-likelihood is:
n
n
2p̂ˆ i
n
ˆ i
n ˆ i = Xi , Xi = ˆ i + , = 0.
(−1)Zi e2ˆ i p̂ + 1 (−1)Zi e2ˆ i p̂ + 1
l X, � p, � =
� Z; −i + Xi log i + log 1 + (−1)Zi e−2i p + Const. i=1 i=1 i=1
i=1 (6)
(2) Proposition 1 provides n separable equations for maxi-
This result follows immediately from the independence mum likelihood estimation (MLE). Our first estimator
of each coordinate. sweeps over values of p̂ (grid searching in a relevant
area) and then for each i = 1..n finds the optimal ˆ i
numerically. The solution is then selected by choosing
Cramer‑Rao bound
the pair (p̂, {ˆ i }ni=1 ) that maximizes the log-likelihood
We begin our analysis by computing the Cramer-Rao
calculated using equation 2.
bound (CRB; [17, 18]). In Comparative study on raw sim-
The obtained MLE equations are solvable, yet, finding
ulations section, we compare the CRB to the error of the
the MLE still requires solving n numerical equations,
estimators.
which might be time-consuming. More importantly,
MLE estimation is statistically problematic when the
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 4 of 11
Method 3 ‑ Gamma distributed Poisson rates As the Phylotree database is based on tens of thousands
The Bayesian statistics approach incorporates prior of sequences, the branches in the tree correspond to rela-
assumptions about the parameters. A common prior for tively short time intervals, making multiple mutations
the rate parameters is the Gamma distribution, which per site unlikely in each branch [22]. However, when
is used in popular Bayesian divergence time estimation considering the mtDNA sequence of other species, the
programs such as MCMCtree [10], BEAST2 [12], and branches in the tree correspond to much longer time
MrBayes [13]. Specifically, we have i ∼ Ŵ(α, β), and for intervals, meaning that many underlying transitions are
p, we use a uniform prior over the positive real line. unobserved. For instance, when comparing two human
sequences that differ in a specific site, Phylotree can
Proposition 3 Let i ∼ Ŵ(α, β), then the maximum a determine whether the trajectory between the sequences
posteriori estimator of p holds: was A → G , A → G → A → G , or A → T → G . How-
ever, when comparing sequences of ancient species, an
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 5 of 11
Fig. 1 We use a large, comprehensive phylogeny, such as Phylotree, and assume its tree topology and branch lengths are known. We also assume
that the phylogeny is detailed enough so that it describes all the substitutions that occurred between its sequences. From this detailed phylogeny,
we extract a list of the number of transitions and transversions that occurred in each site along the phylogeny to get XmtDNA (composed
of the number of transitions that occurred at each site) and a list of phylogeny transversion sites - sites in which at least one transversion occurred
along the phylogeny. We aim to estimate the distance between two sequences that are not necessarily part of the tree and may be much more
distant than branch tree lengths. To do so, we extract a binary vector Z that states for each site whether the sequences are identical ( zi = 0, marked
black) or different ( zi = 1, marked orange). We check in which sites a transversion must have occurred between the two sequences (marked blue)
and remove these sites from XmtDNA and Z , thus shortening these vectors. We do the same for the phylogeny transversion sites
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 6 of 11
to the rate of low activity sites in the mtDNA data. Phylogenetic tree simulations
The other value ( a = 11.87 ) and the probabilities We validated our methods by testing their performance
(0.11, 0.89) were chosen accordingly. To test the in a more realistic scenario of simulating a phylogenetic
robustness of our methods, we have also conducted tree. Our methods take as input the observed transi-
simulations using the celebrated K2P [24] and TN93 tions along Phylotree ( X mtDNA ) using all of its 24,275
[25] substitution models, with rate matrix param- sequences and a binary vector Z denoting the differ-
eters and site scaling extracted from Phylotree’s data. ences between two sequences, which we aim to estimate
Further simulation details appear in the Supplemen- the distance between. We compared our methods to the
tary material, Section 2.1. The comparison results are well-known BEAST2 software [12], which, similarly to
shown in Fig. 2 with the Cramer-Rao bound for ref- other well-established methods (such as MCMCtree [10],
erence. To provide a qualitative comparison, we per- MrBayes [13], etc.) considers sequences along with their
formed a one-sided paired Wilcoxon signed rank test phylogenetic tree to produce time estimations. The soft-
on every pair of models, correcting for multiple com- ware BEAST2 performs Bayesian analysis using MCMC
parisons using the Bonferroni correction. Our results to average over the space of possible trees. However, it is
show that Method 2 has the lowest squared error while limited in its computational capacity, so it cannot handle
Method 1 has the highest squared error, for all distri- a large number of sequences like those in Phylotree. For
butions and substitution models. It is noteworthy that this reason, we used a limited set of diverse sequences,
although Method 3 assumes a Gamma distribution, it including mtDNA genomes of 53 humans [26], the
still performs well even when a model mismatch exists. revised Cambridge Reference Sequence (rCRS) [27], the
Fig. 2 Box-plot of the log squared estimation errors of the three proposed methods for selected values of p, expressed as a percentage of the total
length of Phylotree’s edges (outliers are marked with ∗). The simulations were run 10, 000 times for each value of p. The CRB is shown in black
for reference and the circles represent the log of the mean values which are comparable to the CRB. The experiments were conducted for two
different distributions of and two different substitution models: (Top left) Categorical distribution with two values: ǫ = 0.1with probability η = 0.11
and a = 11.87 with probability 1 − η.(Top right) Gamma distribution with parameters α and β. (Bottom left) Site-scaled K2P substitution model.
(Bottom right) Site-scaled TN93 substitution model
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 7 of 11
root of the human phylogenetic mtDNA tree, termed values of p within the simulated region, and the gap wid-
Reconstructed Sapiens Reference Sequence (RSRS) [28], ens with increasing p. Methods 2 and 3 provide the best
and 10 ancient modern humans [23]. More details about results for the entire range of p. Compared to Methods 2
the parameters used by BEAST2 are available in the Sup- and 3, BEAST2 is less accurate and has a larger variance
plementary material, Section 2.3. To evaluate our meth- for higher values of p. Additionally, BEAST2 has a much
ods, we added a simulated sequence with a predefined longer running time (roughly 1 hour) compared to our
distance from the RSRS. methods (less than a second). BEAST2 simulation pre-
Our aim is to generate a vector that produces a vector sented here was conducted using a fixed tree topology.
X that has a similar distribution to XmtDNA
. The human The results for a simulation without a fixed tree topology
mtDNA tree has 16,569 sites, of which 15,629 have no (running time ∼ 1.5 hours) are presented in Supplemen-
transversions. The MLE of i at each site is the observed tary Fig. 2. Furthermore, to test the effect of additional
number of transitions, XmtDNA,i . However, simulating sequences, we conducted simulations with an additional
as X mtDNA leads to an undercount of transitions because 50 human sequences for selected values of p and a fixed
10,411 sites (67% of the total number of sites considered) tree topology (running time ∼ 3 hours). Supplementary
had no transitions along the tree and their Poisson rate is Fig. 3 presents a comparison of estimation errors for var-
taken to be zero. To mitigate this issue, the rates for these ious BEAST2 estimators compared to Method 2. Finally,
sites were chosen to be ǫ, the value that minimizes the to examine the effect of removing transversion sites, we
Kolmogorov-Smirnov statistic [29, 30] (details are pro- conducted simulations with different ti/tv ratios show-
vided in the Supplementary material, Section 2.2). ing that decreasing the ti/tv ratio does not result in bias
The results are presented in Fig. 3. Similarly to Fig. 2, (Supplementary Fig. 4).
Method 1 has a larger error than Methods 2 and 3 for
Fig. 3 Comparison of our methods with BEAST2 estimator using simulated data. The right plot shows a zoom-in view of the left plot, focusing
on values of p between 0 and 20%. Each point in the plot represents the average of 5 runs, while the shaded regions indicate the range
of estimations obtained. We note that in order to compare to our methods we only present here point estimates for BEAST2 (posterior means)
and not the HPD of the full posterior distribution
Levinstein Hallak and Rosset BMC Genomic Data (2024) 25:4 Page 8 of 11
Fig. 4 We used the following schema to obtain estimates for the time distance between real-world sequences: We first extracted XmtDNA
from Phylotree. Then, we determined the binary vector Z that denotes the differences between the RSRS and the sequence in consideration.
We removed from both XmtDNA and Z the phylogeny transversion sites and sequences transversion sites as explained in Fig. 1. We applied our
estimation methods using XmtDNA and Z as input to get uncalibrated p and calibrated p as explained in Calibration section. Finally, the TMRCA
of the examined sequence and humans is given by TMRCA = 21 (TRSRS + TSequence + pcalibrated )
Real data results average TMRCA with all human mtDNA sequences is to
As the final step of our experiments, we apply our meth- use the RSRS instead. Moreover, we can use the MRCA of
ods to real-world data to determine the TMRCA of the Neanderthals instead of using all Neanderthal sequences
modern human and Neanderthal, Denisovan, and chim- and the same for any other population with a confidently
panzee mtDNA genomes. A schema of our estimation known MRCA, resulting in one estimate instead of mul-
process is provided in Fig. 4. Table 1 displays the uncali- tiple pairwise estimates. This strategy involves combining
brated distances between modern humans and each different possibly noisy TMRCAs (one for each popula-
sequence, compared to the estimates from BEAST2. The tion and one for the distance between the populations).
presented TMRCA represents an average of the TMRCA The estimates from real-world sequences presented in
obtained from 55 modern human mtDNA sequences Table 1 are consistent with those obtained for the simu-
of diverse origins [26]. Table 2 presents the TMRCA in lated dataset in Phylogenetic tree simulations section. For
kya (kilo-years ago) of the modern human and each low values of p, our three methods all produce similar
sequence. We note that if some of the modern human estimates while BEAST2’s has a slightly higher estimate.
sequences were very close to one another, a weighted For the human-Chimpanzee uncalibrated distance, which
average would be more appropriate, considering the is relatively high, Method 1 provides a higher estimate
proximity of sequences and giving sequences closer to than that obtained by Methods 2 and 3, and BEAST2
one another a smaller weight. As the sequences in our provides a substantially higher estimate. The results
case are from diverse origins, a uniform average is a good in Table 2 show the TMRCA estimates, which are sig-
approximation. An alternative strategy to calculating the nificantly smaller for our methods than those obtained