written by NIKKI BREAUX

Ancestral Origins
New Software Looks Even Further Into the Past
Many preconceptions about ethnicities in a multi-ethnic, or “admixed” identity, skin color, is more closely related
individuals start with the assumption that a individual was conceived as part of to geographic distance from the equator
person’s ethnicity can be judged by simply Dr. Batzoglou’s consulting work with than shared ancestry. Racial groups are, in
looking at his or her physical traits. Eyes, 23andme, a Bay Area genetic analysis truth, extremely similar at the genetic level.
skin, even hair all yield clues about the company. Dr Batzaglou explains: “The Research has shown that, on average, only
origin of a person’s family. But in the lab of problem of determining the ancestral 0.1% of DNA will differ between any two
Dr. Serafim Batzoglou, assistant professor population for each location of an individuals. Of this 0.1%, only 10-15% varies
in the Department of Computer Science, individual’s chromosomes appealed to between groups of different ethnic origin;
ancestral analysis goes much deeper. Dr. us as scientifically interesting, and one in most of the remaining genetic diversity
Batzoglou and graduate students Eugene which we could contribute computational contributes to variety within a population
Fratkin, Andreas Sundquist and Chuong Do methods that are substantial improvements of shared ethnic origin. Included in that 10%
have developed software that can pinpoint over the state of the art.” are genes containing small mutations that
the ethnicity of ancestors from ten or even accumulate over generations.
twenty generations ago by looking at The new software was named “HAPAA,”
the historical information buried in their a play on the term “Hapa” used in Hawaii A Haploblock Mosaic
descendant’s DNA. How was this level of to describe a person of mixed ethnic HAPAA was designed to analyze patterns
precision achieved, and what can it tell us origin. Race and ethnicity have been used of change in sequences of DNA from
about our ancestry? throughout history as defining, if not different populations over long periods of
divisive, characteristics; conflicts between time. Random single nucleotide mutations,
Clues Hidden In the ethnic groups have triggered many of the called “SNPs,” appear frequently, scattered
DNA worst battles in human history. But the
concept of racial identity is far from simple.
throughout human chromosomes. If the
The idea of developing software capable mutations are not harmful, they may escape
One of the most visible clues to racial the ravages of natural selection and be
of identifying the contribution of different

A map of human migrations based

on mitochondrial DNA, with dates
denoting time before present.
Populations are described by mtDNA
haplogroups: African: L, L1, L2, L3,
Near Eastern: J, N, Southern European:

J, K, General European: H, V, Northern

European: T, U, X, Asian: A, B, C, D, E,
F, G (note: M is composed of C, D, E,
and G) Native American: A, B, C, D, and
sometimes X

ENG + TECH written by NIKKI

inherited by successive generations. Because positions in a genotype; our method can

they differ between many individuals, SNPs utilize LD patterns of arbitrary genomic
function as “markers” for different ancestral distances.” This difference expanded the
histories. predictive power of the model, allowing
HAPAA to look ten generations into the past,
The HAPAA program identifies groups of while maintaining a low error rate.
SNPs found in unique regions of the DNA
that are usually inherited together over Taking Apart the Mosiac
multiple generations. These collections of Identifying a haploblock in an individual
multiple SNP markers, called “haploblocks,” is the first step in inferring ancestry. This
are eroded over generational time by haploblock would most likely be found
mutations and recombination. The interspersed with bits of other haploblocks
haploblocks of a child born in 2000 will look due to various recombination events. Those
like a genetic mosaic, a fragmented version small fragments of the original haploblock,
of the haploblocks of his 20th generation however, are enough to backtrack to the
ancestors. ancestral DNA sequence. Before taking
on the challenge of looking at current
The predictive power of HAPAA relies on populations, however, Dr Batzoglou and
“linkage disequilibrium” between the SNPs his lab had to train HAPAA to identify the
in each haploblock. Linkage disequilibrium fragments of the ancestral haploblock
(LD) refers to the likelihood of certain and retrace the lineage that produced the
genetic combination being found in the current admixture.
same individual. For example, consider two
genes A and B, which exist as versions A As a starting point for the development of
and A’, as well as B and B’. If the two genes the HAPAA program, Dr Batzoglou drew
sorted independently during mating, then upon the DNA sequences found in HapMap,
an individual with A or A’ will have the a database of genetic variants found in
same probability of having B or B’; genes different individuals. 210 DNA sequences,
A and B would therefore be considered to representing nearly unrelated individuals of
be in “linkage equilibrium.” However, genes North Western European, West African and
often sort in linkage disequilibrium, with East Asian ancestry, were drawn from this
a disproportionate number of individuals database and used as the “training set” for
Triangular representation of the inheriting a specific combination, for HAPAA. A smaller group of these individuals
average West African, European and example A with B, or A’ with B’. was withheld for testing of the finalized
Indigenous American admixture pro- program.
portions of different populations. Other genome software take advantage of
linkage disequilibrium, but HAPAA takes the In order to build greater predictive power
analysis several steps further. According to into HAPAA, graduate students simulated
Dr. Batzaglou, “previous methods could only successive mating over 20 generations
take into account LD between neighboring of individuals drawn at random from the

training set. The model assumed that, as distinguish contributions from an was
in real life, there was a small degree of ancestor 20 generations prior to his or her very
recombination and genetic drift (a random descendant, its predictive power is limited common,
skew in gene frequencies that happens to the ethnic groups used in the training especially
over time). By knowing the ancestries of set. Dr. Batzaglou would like to see further among the
every fragment in the mosaic pattern of the improvements, “making it more effective on African-American and
resultant haploblocks in each generation, fine-grained distinctions between highly Hispanic populations. Among
HAPAA could be trained to infer the patterns similar populations” such that the program the average subject who was
of inheritance across multiple generations. could distinguish between as many as the self-described as “white,” there was a 0.7%
50 different populations on Earth. He can African Ancestry, while the average African
After many rounds of training, HAPAA was also foresee wide-reaching applications American had 15-18% European genetic
ready for its final testing. Dr Batzoglou and of the software. As he explains, “imagine ancestry. A different team of scientists,
his students created simulated individuals several thousand individuals from different led by Cerda- Flores of the Universidad
from the test set withheld from training. neighboring populations having their Nacional Autónoma de México, found that
These individuals were the result of multiple ancestry painted in their chromosomes the average Hispanic American had 3-8%
rounds of simulated mating and each had using HAPAA. The statistics of the ancestral African ancestry. These estimates, however,
only a single ancestor from one ethnic blocks could tell us about the history, were only able to detect the ethnicity of
group, while all other ancestors were migration and admixture patterns of the the previous seven generations, which
drawn from a different ethnic group. The nearby populations in the past several strongly reflected the forced migrations of
challenge was for HAPAA to identify the hundred years.” slave populations to the Americas. HAPAA
location of the contributions of the two has the power to look at trace ancestral
different ethnicities. The results of these Admixed America history over 20 generations, before America
tests indicated that HAPAA could determine How closely does the American population was colonized by Europeans.
with high accuracy the contribution of a resemble these ancestral populations? This
single ancestor within the previous ten question has been posed and answered
generations. The prediction was more through several lines of research. In
accurate when the two populations sampled one study led by Dr. Shriver, Associate
were assumed to be highly divergent. The Professor of Anthropology and Genetics
more genetically similar the two populations at Pennsylvania State University, a sample
were, the more difficult it became to detect population in Chicago was analyzed using
the contribution of the single ancestor. an older genetic analysis method. Dr.
However, in populations separated by a Shriver’s results determined that admixture
large amount of time, the contribution of a
20th generation ancestor could be detected.
Nikki Breaux is a PhD student researching cancer biology. In her free time, she enjoys
Mapping Human reading, cycling and enjoying Califonia’s weather.
The Batzaglou lab is currently working To Learn More
to make HAPAA available to the general For more information, visit Dr. Batzaglou’s departmental website at http://ai.stanford.
public. Although HAPAA can currently edu/~serafim/. Additional literature and discussion about race and genetics can be
found at

volume VII 41