1 s2.0 S0198885921001750 Main

Human Immunology 82 (2021) 746–757
Contents lists available at ScienceDirect
www.ashi-hla.org
journal homepage: www.elsevier.com/locate/humimm
Research article
HLA haplotype frequency estimation for heterogeneous populations

using a graph-based imputation algorithm
Sapir Israeli a, Loren Gragert b,c,d, Martin Maiers b,c, Yoram Louzoun e,⇑
a
Department of Computer Science, Bar Ilan University, Ramat Gan, Israel
b
Center for Blood and Marrow Transplant Research, Minneapolis, MN, USA
c
National Marrow Donor Program/Be The Match, Minneapolis, MN, USA
d
Department of Pathology, Tulane University, New Orleans, LA, USA
e
Department of Mathematics, Bar Ilan University, Ramat Gan, Israel
a r t i c l e i n f o a b s t r a c t
Article history: HLA haplotype frequencies are estimated from ambiguous unphased HLA genotyping data using
Received 16 September 2020 Expectation-Maximization (EM) algorithms. Current population genetics methods require independent
Revised 25 June 2021 EM frequency estimates for each population, and assume that each population is in Hardy-Weinberg
Accepted 5 July 2021
Equilibrium (HWE). The HWE assumption of EM has thus far resulted in the exclusion of individuals from
Available online 26 July 2021
mixed or unknown ethnic backgrounds from reference datasets. Multi-region populations are currently
poorly served by stem cell donor registry HLA imputation and matching implementations due to the
Keywords:
inability of such algorithms to incorporate admixture into their population genetics models. To address
Multi-region expectation-maximization
algorithm
this unmet need, we have expanded the imputation component of our GRaph IMputation and Matching
HLA (GRIMM) framework, where imputation becomes the expectation step in an iterative EM algorithm. Our
Haplotype frequencies novel multi-region EM implementation considers region as a Bayesian prior, enabling integration of HLA
information from multiple single-region population groups, and for the first time including individuals
with ambiguous or mixed ethnic backgrounds. We show that our multi-region EM produces much higher
likelihood values and better haplotype recovery as measured by Kullback-Leibler divergence than all
evaluated EM implementations when tested on real datasets of US donor registry HLA typings as well
as simulated multi-region datasets of ambiguous HLA typings.
Ó 2021 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights
reserved.
1. Introduction population-specific reference HLA haplotype frequency distribu-

tions used by imputation and matching algorithms are estimated
HLA genes located in the major histocompatibility complex by expectation–maximization (EM) algorithms [5]. Because the
(MHC) on chromosome 6 are highly polymorphic [1], with a large decision of which ambiguously-typed registry donors to select
number of alleles and haplotypes with differing frequencies among for further confirmatory typing (CT) is strongly informed by match
human populations [2]. Stem cell donor registry HLA typing has algorithm predictions, any improvements in HLA reference data-
been performed by a variety of assays that are not always able to sets have the potential to translate into the identification of more
precisely identify the alleles. Earlier low-resolution DNA-based and better matches for patients. While current US reference data-
methods can yield thousands of potential genotypes, while newer sets include 21 detailed single-race categories and 5 rollup race/
next-generation sequencing (NGS) technology has reduced this ethnic categories, individuals with multi-region and unknown
ambiguity significantly [2,3]. region identification were excluded from haplotype frequency esti-
Identifying high-resolution HLA-matched donors when search- mation. Thus high-resolution HLA prediction performance for
ing registries for a donor for hematopoietic stem cell transplanta- single-region individuals is better than what we observe for
tion (HSCT) relies on imputation algorithms to provide donor- multi-region individuals, representing an underappreciated barrier
recipient match predictions from ambiguous typing data [4]. The in access to HSCT for multi-region patients seeking a well-matched
donor.
⇑ Corresponding author. The population categories (further denoted here as regions)
E-mail address: louzouy@math.biu.ac.il (Y. Louzoun). used by EM and registry match algorithms are defined by grouping
https://doi.org/10.1016/j.humimm.2021.07.001
0198-8859/Ó 2021 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.
S. Israeli, L. Gragert, M. Maiers et al. Human Immunology 82 (2021) 746–757
together responses to a self-identified race/ethnicity (SIRE) ques- 2.2. Generating expectation for haplotype pair probabilities
tionnaire captured on the donor recruitment form [67]. While use-
ful for HLA match predictions, these SIRE categories have several The expectation component of the EM algorithm involves enu-
notable limitations. These categories change over time and are merating all possible multilocus haplotype pairs given an ambigu-
not scientifically determined categories. For example, the concepts ous HLA typing, then assigning a probability to each haplotype
of race and ethnicity, while they are separated now, were treated pair. In the first step, the imputation algorithm from GRIMM enu-
as interchangeable in the past. The US Census categories, which merates all possible phases (labeling all potential diplotypes for a
government contractors are required to adhere to, refer to skin given genotype). Within each phase, there may be allelic ambigu-
color (e.g. White, Black). Even if an improved set of SIRE categories ity, which is expanded in a second step to generate a list of haplo-
were used this would not avoid the fact that they are partially a type pairs based on all possible alleles. The probability of each
social construct and can change over time for a given individual. haplotype pair is the product of the frequency of the two haplo-
The true ancestral population source of each inherited HLA haplotypes divided by the sum of the products for all possible haplotype
type may not be known by individuals, especially if the population pairs.
admixture happened generations back. Complicating the process The reference HLA haplotype frequencies used for generating
for self-reporting individuals can select as many categories as they simulated typings in this study were at the ‘‘high-resolution” level,
see fit, with some SIRE categories indicating admixture implicitly which is defined as combining alleles with identical amino acid
(e.g. Hispanic, African American). Self-identification in multiple sequences encoded by the antigen recognition domain. In our sim-
race/ethnic categories often over-states or under-states the degree ulations, HLA typing was represented in the genotype list string as
of admixture compared to where it occurs in an individual’s family lists of possible two-field alleles in IMGT/HLA nomenclature. High
tree [8]. Similar challenges with the US registry SIRE exist in all resolution was utilized because most typing in the NMDP registry
organizations that register donors and cord blood units for trans- database was performed by targeted HLA genotyping methods that
plantation. The WMDA maintains a listing [19] of donors and cord did not systematically sequence all HLA exons.
blood units from 98 organizations in 54 different countries, most of The NMDP EM and GRIMM algorithms as written could feasibly
whom have little or no SIRE information and the categories used utilize typing represented at two-field level, which would distin-
are often incompatible between countries. Some European coun- guish every unique HLA amino acid sequence across all exons as
tries restrict or even ban the collection of this data [9]. Despite its own allele. To perform two-field imputation and haplotype fre-
these challenges of missing and imperfect SIRE, population- quency estimation, the input typing data would consider any two-
specific haplotype frequency data remains a fundamental and field alleles that were not distinguished by the typing method as
essential component of all bone marrow registry matching systems ambiguities, regardless of antigen recognition domain equivalency.
[10]. Haplotype frequency estimation at full-field resolution can be sim-
In this study, we developed a novel EM approach that treats ilarly accomplished by generating genotype lists with every full-
population membership as a hidden variable. We use this approach field IMGT/HLA allele not distinguished by the typing listed as
to build up haplotype frequency estimates from a dataset of HLA ambiguities in the genotype list string.
typings that include all individuals, regardless of which region cat- In imputation algorithms, the frequency of each haplotype in
egories were selected. We have extended the imputation capability the list is looked up in a dataset of reference population haplotype
of the Graph Imputation and Matching (GRIMM) framework for frequencies. When estimating haplotype frequencies using EM
HLA genotype imputation and donor-patient matching using com- algorithms, the haplotype frequency distribution must be initial-
putationally efficient graph traversal [4]. The first extension was to ized as a first step using the input HLA typings, then frequencies
run GRIMM imputation iteratively. In the expectation-step, the are updated at each iteration during the maximization step using
haplotypes best explaining each donor are imputed, and then in imputation. EM algorithms are thus iterative imputation algo-
the maximization-step, these haplotypes are combined to produce rithms that start with a bootstrap initialization of haplotype
new expected haplotype frequencies. The second extension was to frequencies.
run EM in multiple stages, first computing haplotype frequency We have implemented imputation using a graph database
estimates treating all individuals as if they came from a single pop- framework - GRIMM. In the graph, each HLA allele and haplotype
ulation, then using the initial frequency estimate to seed a Multi- is represented as a node, then haplotype pair probabilities are
Region (MR) EM stage. We here present a comparison of haplotype pre-loaded for each sub-population. The graph provides rapid
frequency estimates from this multi-region EM algorithm against access to find the full-locus haplotype of the given partial haplo-
estimates from conventional single-population EM algorithms. type. Then, a Cartesian product is performed on all haplotype pairs
First, we compared our haplotype frequencies results of the single consistent with the HLA typing to find the most probable haplo-
population to conventional single-population EM algorithms find- type pairs.
ing that our algorithm outperforms existing algorithms. Then we
compared our single population result to our MR EM stage results, 2.3. Missing haplotypes in GRIMM
finding that the MR EM approach achieves superior performance
on both real and simulated datasets of HLA typing from mixed To impute every individual’s genotype list, GRIMM is divided
populations. into 2 main components, called ‘‘Existing Haplotypes Solution”
(EHS) and ‘‘Missing Haplotypes Solution” (MHS). EHS applies when
there is at least one haplotype pair in the distribution of population
2. Methods haplotype frequencies consistent with the subject input. In the
absence of any haplotype pairs consistent with the HLA typing,
2.1. HLA typing data represented in Genotype List String format MHS is used. MHS solves 2 cases: 1) genotype lists that can not
be explained by any haplotype pairs in the reference distribution,
DNA-based HLA typing data has been captured and stored in the and 2) genotypes with alleles that are missing in the reference
NMDP database using heterogeneous methods (primarily data. MHS is required for this EM implementation because the enu-
sequence-specific oligonucleotide probe (SSOP), and Sanger/NGS), meration of haplotype pairs is incomplete during initialization. To
with every HLA typing result precisely represented using the geno- make the initialization of the haplotype frequency distribution
type list string format described by Milius et al. [11]. more tractable, HLA typings encoded with allelic ambiguity are
747
excluded unless there are fewer than 100 possible haplotype pairs, the frequencies of each haplotype block. For example, here, we first
as described in section 2.5. At later stages, a cutoff on the cumula- split into class I and class II, and then we split class I to A and BC,
tive frequency of haplotype pairs limits the number of haplotype etc. We then use the solution requiring a minimal number of
pairs for individuals who are included. recombination. To incorporate the probability of observing a new
MHS1: If no extended haplotype pairs in the frequencies fit the recombination event, the resulting extended haplotype frequency
observed typing, but at least one allele per multilocus genotype is computed based on multiplying haplotype blocks is adjusted by a
found in the initialized frequencies, we attempt to recombine factor_splitting of 0.0001. Note that in this case, all haplotype pairs
existing alleles and haplotypes to generate new extended haplo- of this donor are multiplied by factor_splitting, and finally the
types that are consistent with the typing. We first hierarchically probability assigned to the recombination cancels out following
split the haplotype, based on locus structure, into blocks consisting the normalization of all solutions for this donor to 1.
of fewer loci than the extended haplotype and estimate the fre- MHS2: If an allele was missing in the initialization of the fre-
quency of newly generated extended haplotypes by multiplying quency data, during MHS2 we generated a new haplotype (FH)
composed of the most probable combination of existing alleles at
the other loci (UFH) and the missing allele. Newly-generated hap-
Table 1 lotypes not found in less ambiguous typings should be rare, thus
we assign their frequency to be significantly lower than observed
Symbol Description haplotypes. Thus, we multiplied the probability of UFH by a fac-
tor_missing_data (kÞ of 0.00001 raised to the power of the number
Hn The nth haplotype.
HRn The nth haplotype-region combination. of missing alleles (Eq. (1)). The value of factor_missing_data is less
NH Number of haplotypes. significant since the haplotype pairs of each individual are normal-
N HR Number of haplotype-region combinations. ized, but it must be less than 1 (See Table 1 for list of symbols).
Np Number of populations.
GðHi ; Hj Þ The genotype composed from haplotypes i and j. PðFHÞ ¼ PðUFHÞ kmissh ð1Þ
GðHRi ; HRj Þ The genotype composed from haplotype-regions i and j.
t Iteration number.
priorðr i ; r j jr1; r2Þ A sample’s prior for regions i and j when the SIRE are r1 2.4. Multi-region (MR) GRIMM and priors
and r2.
hapx r i Haplotype x in region i.
Fi Sum over all haplotypes in population i. The imputation algorithm used in the expectation step of EM
FH Full haplotype - haplotype with all X locus. provides a probability distribution for the HLA haplotype-region
UFH Partial haplotype - haplotype with m locus when m < X. pairs consistent with an ambiguous HLA typing and the reference
k Factor for alleles missing in the data.
population HLA haplotype frequency distributions.
missh Number of all alleles that appears in specific haplotype but
not in the database.
Each individual in the dataset can have from zero to many
input size Number of subjects in the input. region labels, usually from self-identification during donor recruit-
ment. To incorporate knowledge of identified region, we used a set
Glossary of terms used in equations.
of simple possibilities, taking into account that still a large fraction
(A) Priors coefficient = 0.2, 0, 0.3, 0.4, 0.1 (B) Region1 = AAFA, Region2 = NAMER Region1 = AAFA, Region2 = AAFA
Priors coefficient = 1, 0 0 0, 0
AAFA NAMER CARB AAFA NAMER CARB
Region1 = AAFA, Region2 = NAMER
AAFA AAFA
AAFA NAMER CARB
NAMER NAMER
CARB CARB
AAFA 0+0.3+0.4+0.1 0.2+0 0.4 0+0.4
Priors coefficient = 0 1, 0 0, 0
NAMER 0.2+0+0.4 0+0.3+0.4+0.1 0+0.4 AAFA NAMER CARB AAFA NAMER CARB
AAFA AAFA
CARB 0+0.4 0+0.4 0+0.3 NAMER NAMER
CARB CARB
Priors coefficient = 0, 0, 1, 0, 0
Region1= AAFA, Region2 = AAFA AAFA NAMER CARB AAFA NAMER CARB
AAFA NAMER CARB AAFA AAFA
NAMER NAMER
AAFA 0.2+0+0.3+0.4+0.1 0 0.4 0+0.4 CARB CARB
Priors coefficient = 0, 0 0, 1 0
NAMER 0+0.4 0+0.3 0 AAFA NAMER CARB AAFA NAMER CARB
AAFA AAFA
CARB 0+0.4 0 0+0.3 NAMER NAMER
CARB CARB
Priors coefficient = 0, 0 0 0, 1
AAFA NAMER CARB AAFA NAMER CARB
AAFA AAFA
NAMER NAMER
CARB CARB
Fig. 1. Illustration of the contribution of imputation parameter values to priors for region combinations. Each sample matrix is the prior coefficients for all region pairs, when
the dataset contains information about 3 regions: AAFA, NAMER and CARB. (A) The upper matrix shows the calculation of MR (regions = AAFA and NAMER) sample, and the
bottom matrix shows the calculation of single-region (region = AAFA). (B) The left column shows examples of matrices for a MR sample (regions = AAFA and NAMER) with the
different priors. The right column shows examples of matrices for a Single Region (SR) sample (region = AAFA). A colored cell represents a prior with the value 1, and a white
cell represents a 0 prior. A value of 1 indicates combinations where the prior is incorporated. g- Use all possible imputation output irrespective of region, a- Only use
imputation haplotype pairs with input multi-region combination, b- Allow for all single-region populations, ɣ- Impute possible combinations of haplotype pairs with input
region, d- Pick Single region imputation output of either of the input region.
748
of the population is associated with a single SIRE. We assumed that 2.5. Multi-region EM algorithm
we either know the SIRE or can choose with equal probabilities
from all other SIRE for one or two of the donors’ haplotypes. We Each iteration of the EM algorithm consists of expectation (E)
utilize up to 5 different parameter values that when combined con- and maximization (M) steps. In the expectation step, we use the
tribute prior weightings for each possible population combination estimated population-specific haplotype frequency distributions
(r i ; rj Þ which depends on the SIRE (r1, r2). These parameter values to estimate the probabilities of each haplotype pair for each indi-
are set in advance. vidual. In the maximization step, we accumulate the haplotype
The 5 parameters that contribute to region combination prior pair probabilities across each individual as computed in the E-
weightings (Fig. 1) are defined as follows: step to update the population-level haplotype frequencies. Itera-
tions continue until the EM algorithm converges. In the multi-
1. a- Only use imputation haplotype pairs with a given SIRE region stage of the EM algorithm, there are multiple populations,
combination. so the prior probabilities for each haplotype’s membership in a
2. b- Allows for all single region populations. population are also updated at each iteration.
3. ɣ- Impute possible combinations of haplotype pairs with region
from the given SIRE. 2.5.1. Stage 1 – Single-region EM stage including all samples
4. d- Pick single-region imputation haplotype pairs of either of the To begin the EM algorithm, initial frequencies must be assigned
given SIRE. to each possible HLA haplotype in the estimated distribution.
5. g- Use all possible imputation haplotype pairs irrespective of Given the HLA typing for each individual, we enumerated each
the given SIRE. possible haplotype pair for each individual from phases that can
be interpreted up to a certain level of allelic ambiguity of init_cut-
The formula for computing prior weightings for region cate- off possible haplotype pairs. We used init_cutoff of 100 since in all
gories based on a combination of the 5 parameters is provided in simulations almost all phases were interpreted to less than 100
the following equation, where r1; r2 are the self-identified region, options or to more than 10,000,000 (see supp. mat). Thus, the
and r i; r j are region combinations: 100 cutoff provides appropriate results in terms of computational
cost (time and space) and performance. In the case that all or most

prior r i; rj jr1; r2 ¼ g þ a 1 ri ¼¼ r1 and r j ¼¼ r2 of the typings Interpreted to more than 100 haplotype pairs init_-
cutoff should be changed accordingly. This parameter controls the
or ðri ¼¼ r2 and r j ¼¼ r1Þ
initialization, thus different values can lead to different results.
þ b 1 ri ¼¼ r j þ c
Other common options for EM haplotype frequency initialization
1 r i ¼¼ r1orr j ¼¼ r2 or r j ¼¼ r1 or ri ¼¼ r2 include random initial conditions or equal probabilities for each

þ d 1 r i ¼¼ r1 and rj ¼¼ r1 possible haplotype pair but this initialization is often sub-

or r i ¼¼ r2 and rj ¼¼ r2 ð2Þ optimal. We ran iterations of the EM algorithm until either conver-
gence based on the change in log-likelihood and frequencies or
Each prior r i; r j jr1; r2 was multiplied by the frequency sum F of until a maximum number of iterations were reached (currently
all haplotypes in populations i and j (Eq. (3)), then a matrix normal- 50 according to our simulations). Convergence was defined as a
ization was performed (Eq. (4)). likelihood change in one iteration of less than a value of input1 size
per sample (i.e. the total log-likelihood divided by the number of
prior r i; rj jr1; r2 ¼ prior r i; r j jr1; r2 F i F j ð3Þ samples), and the absence of haplotypes in the new step with a fre-
quency above input1 size. The EM iterations include: the E-step as
described in Sections 2.2 and 2.3; first, we created a graph from
prior r i; r j jr1; r2
prior r i; rj jr1; r2 ¼ PNp PNp ð4Þ the frequencies list estimated from the last iteration and then
i j prior r i; r j jr1; r2 imputed the haplotype pairs of each individual. In M-step we cal-
culated the probability of each haplotype, based on the haplotypes-
Multiple region labels are provided when the individual is
pairs found in E-step. The sum of haplotype frequencies is normal-
multi-region. The SIRE provides information about the region
ized to 1, and those frequencies created the new frequencies list.
labels of the genotype but not for the haplotypes, therefore the
(The details of the E and M steps are in Fig. 2- upper left box).
prior use for haplotype pairs and not just for single haplotype.
Thus, in the haplotype product probabilities calculation, the haplo-
type frequencies were multiplied by the priors for each region 2.5.2. Stage 2 – multi-region population EM stage
combination. The same typing with different SIRE may yield differ- For the initialization, the estimated haplotype frequency distri-
ent outcomes. The list of region categories provided by the individ- bution from the single-region EM in stage 1 is assigned to each
ual contributes some weighting, but not 100% weighting, to the region label defined in the sample. Then we ran the E and M steps
calculation of these priors. Basically, we consider mainly the SIRE until convergence, with the same stopping conditions as the
but also other region combinations. Since the region identification single-region, with the difference that the convergence criterion
is imperfect, the prior weighting formula allows for haplotype is computed across all frequency distributions. In the E step, we
pairs from any region combination to be taken into account. created a graph from the frequencies list estimated from the last
iteration as in single-region (SR) EM, but here the list contains hap-
lotypes from several regions and then imputed the possible
P hapx r i ; hapy rj jr1; r2 ¼ priorðr i; r j jr1; r2Þ pðhapx ri Þ pðhapy r j Þ ð5Þ
haplotype-region pairs of each individual. The difference between
To find the most optimal values for the parameters, we have the multi-region and single-region EM stages is that the multi-
checked the parameter combinations maximizing the similarity region EM incorporates priors for each region combination into
between the imputed genotypes and the genotypes determined the expectation step (Section 2.4) so that it does not impute all
by real high-resolution typing by applying a Nelder-Mead opti- possible combinations but only the most probable combinations,
mization on the fraction of the probability in the imputed typing and thus reduces the number of potential haplotype pairs. In the
assigned to the real typing. The optimal results are: a = d = 0.5, M step, we calculated the probability of each haplotype in each
b = ɣ = 1e7, g = 0. region group based on the haplotype-region pairs found in the E-
749
step. The sum of the haplotype frequencies in each region is nor- typing at the registry scale such that they failed to run the datasets
malized to 1. The frequencies sum of all regions before the normal- we developed to access algorithm performance. Pypop [14]
ization are used in the next iteration of the imputation (Fig. 2- (Python for Population Genomics), estimates haplotype frequen-
bottom left box). cies using an EM algorithm and several other population genetic
analyses on multilocus genotype and allele frequency data, but
2.6. EM on very large samples similarly to Haplo-stats, it cannot handle allelic ambiguity, only
phase ambiguity. Stephens et al. [15] proposed an EM algorithm
Our EM algorithm is set up to always include all individuals PHASE using Bayesian methods to use a priori expectations to
when computing haplotype frequencies. GRIMM works online inform haplotype reconstruction, but it has a limitation of 105 pos-
and imputes each person at a time; thus, the memory usage of sible haplotypes and the algorithm could not run our evaluation
the EM is the haplotype list. If the HLA typing data input contains datasets that exceeded this limit.
a large number of subjects and/or highly ambiguous typings, the
number of possible unique haplotypes may be too large to handle
2.8. Simulated and real registry HLA typing datasets for EM
in memory. If in any iteration, too many haplotypes are obtained,
performance evaluation
only a fixed number of haplotypes with the highest frequency
are retained for the next iteration, thus new haplotypes may be
Simulated datasets of HLA typing were generated by sampling
added back in the M-step of the next iteration. In the current
haplotypes from previously published frequency estimates, then
implementation, the maximum number of haplotypes that we
hiding haplotype phase information and introducing typing
allow in memory is 600 million. When the max number of haplo-
ambiguity.
types is breached, the 100 million most frequent haplotypes are
The first simulated dataset ‘‘Freq_CAU” sampled haplotypes
selected, and new haplotypes are allowed to accumulate as more
from the US White Caucasian dataset that contained 266,309
HLA typings are processed until the max is reached again. In the
unique haplotypes. 10,000 high-resolution HLA typings were sim-
datasets we evaluated for this study, we did not reach this limit,
ulated. Only the 3000 most common haplotypes were sampled. We
however, this threshold was implemented to avoid potential mem-
first simulated 3000 typings where each haplotype appears exactly
ory crashes.
twice (the combination is random), to ensure every haplotype was
included in the typing data. The remaining 7000 typings were ran-
2.7. Alternative single-region EM algorithms for comparison
dom combinations of the 3000 haplotypes. This simulation was the
least difficult for EM estimation as all HLA typings were complete
Current published EM algorithm implementations estimate
high-resolution for all 5 loci and each of the 3000 sampled haplo-
haplotype frequencies using a population genetic model where
types appeared at least twice in the dataset.
all individuals are from the same population. Our multi-region
We also used NMDP registry haplotype frequencies to simulate
EM is the only implementation capable of estimating haplotype
heterogeneous collections of HLA typings from 2 different popula-
frequencies for each sub-population when individuals are from a
tion combinations, including some individuals where one haplo-
heterogeneous collection of populations.
type was sampled from one population and the other haplotype
We compared the performance of our multi-region EM to 3
sampled from the second population. Each simulation contains
popular implementations:
10,000 individuals, of which 9000 were single-region and 1000
multi-region. We also tested subsets of 3000 individuals from each
Hapl-o-Mat [12] is open-source software for HLA haplotype fre-
simulation (Table 2).
quency estimation from ambiguous typing data at the registry
We also studied one real registry dataset of stem cell donors
scale. Hapl-o-Mat handles large population samples with an
with their recruitment HLA typing before high-resolution confir-
arbitrary number of loci but some limitations on the degree of
matory typing (CT) typing was performed, consisting of 55,419
typing ambiguity.
single-region and multi-region individuals from among 28 region
NMDP EM – The EM pipeline used for generating NMDP registry
categories (Table 2). To measure the impact of the sample size on
haplotype frequencies used by HapLogic and HaploStats starts
EM performance, we randomly subsetted the CT dataset to gener-
with a greedy algorithm that reduces the list of the possible
ate samples of 1000, 3000, and 10,000 donors. These simulations
haplotypes to only those required to explain every subject.
are called: High res CT-1000, High res CT-3000, High res CT-
The EM algorithm builds up haplotypes in 2-locus blocks fol-
10000, the complete High res CT (Table 2).
lowed by trimming improbable genotypes based on imputation
We designed the EM evaluation datasets to have varying levels
output before the next locus is added. The source code for the
of complexity from easy (1) to hard (4):
NMDP EM is available at https://github.com/nmdp-bioinformat
ics/nmdp-em-5locus/.
1. Full 5-locus genotypes from a single population without allele
Haplo.stats [13] is an R package containing an EM that progres-
ambiguity (Freq_CAU).
sively inserts batches of loci into haplotypes of increasing
2. Ambiguous full 5-locus genotypes (AAFA-CARB).
length instead of using all pairs of haplotypes that are consis-
3. Ambiguous genotypes with missing typing at some loci, where
tent with the subject, runs the EM steps, computes posterior
most typings can be split into less than a million possible hap-
probabilities of haplotype pairs for each subject, trims off pairs
lotype pairs (High res CT).
of haplotypes per subject when the posterior probability of the
4. Ambiguous genotypes with missing typing at some loci, where
pair is below a specified threshold. The algorithm continues
most typings have more than a million possible haplotype pairs
these insertions, EM, and trimming steps until all loci are
(AAFA-NAMER, FILII-NAMER, MENAFC-NAMER).
inserted into the haplotype. This software has a limitation in
that it cannot handle allelic ambiguity, only phase ambiguity,
limiting its application to input datasets where only one pair 2.9. Measures of EM performance
of alleles is possible at each locus.
Two measures were used to compare the different EM methods:
We also note several other published EM algorithms that have Kullback–Leibler (KL) divergence – measures the haplotype
limitations that preclude practical use in running ambiguous HLA frequency difference between the true sampled haplotypes used
750
Table 2
HLA Typing Dataset Region Groups (size) Size

a
Freq_CAU Caucasian (10,000) 10,000
AAFA-CARB African American (6,000), Black Caribbean (3,000)b 10,000
AAFA-CARB-3000 African American (1,800), Black Caribbean (880)b 3,000
AAFA-NAMER African American (6,000), North American White (3,000)b 10,000
AAFA-NAMER-3000 African American (1782), North American White (923)b 3,000
FILII-NAMER Filipino (6000), North American White (3000)b 10,000
FILII-NAMER-3000 Filipino (1,783), North American White(898)b 3,000
MENAFC-NAMER MidEast/No. Coast of Africa (6000), North American White (3000)b 10,000
MENAFC-NAMER-3000 MidEast/No. Coast of Africa (1843), North American White (865)b 3,000
High res CT-1000 Caucasian (470), North American White (250), Unknown (48), Hispanic or Latino (42), African American (23), Mexican or 1,000
Chicano (18), South/Cntrl Amer. Hisp. (13), Other Southeast Asian (11), Filipino (8), MidEast/No. Coast of Africa (7), Chinese (7),
Korean (6), North American Indian (6), Caribbean Hispanic (6), African American’s broad group (5), Other (5), Vietnamese (3),
Asian and Pacific Islander (2), Native Americans (2), Hawaiian or other Pacific Islander (1), Black Caribbean (1), Japanese (1)c
Chicano (62), Other Southeast Asian (37), MidEast/No. Coast of Africa (30), Other (25), Caribbean Hispanic (24), South/Cntrl
Amer. Hisp. (21), African American’s broad group (19), Chinese (16), Korean (14), North American Indian (13), Filipino (10),
Asian and Pacific Islander (9), Black Caribbean (9), Native Americans (7), Japanese (7), Vietnamese (6), African (5), South Asian
(3), Hawaiian or other Pacific Islander (2), South Central American Black (1)c
Chicano (178), Other Southeast Asian (113), Caribbean Hispanic (107), South/Cntrl Amer. Hisp. (102), MidEast/No. Coast of
Africa (79), Chinese (73), North American Indian (72), African American’s broad group (57), Korean (54), Other (54), Asian and
Pacific Islander (46), Filipino (39), Vietnamese (32), Black Caribbean (25), Native Americans (24), South Asian (13), Japanese
(11), Hawaiian or other Pacific Islander (9), African (7), South Central American Black (3)c
The complete High res CT Caucasian (25827), North American White (13224), Unknown (2881), Hispanic or Latino (2043), African American (1710), 55,419
Mexican or Chicano (943), Caribbean Hispanic (622), Other Southeast Asian (579), South/Cntrl Amer. Hisp. (576), MidEast/No.
Coast of Africa (460), Chinese (397), Other (327), North American Indian (320), African American’s broad group (293), Korean
(263), Asian and Pacific Islander (230), Filipino (223), Vietnamese (171), Native Americans (133), Black Caribbean (127),
Japanese (113), African (89), South Asian (84), Hawaiian or other Pacific Islander (53), South Central American Black (18),
Alaska Native American (6), Caribbean Indian (3)c
Simulation details. The ’Regions’ column shows the regions in each simulation and the number of SR subjects from each region. The ’Size’ column shows the total number of
subjects, SR and MR, in the simulation.
a
Freq-CAU Simulated Typing Ambiguity – Complete high-resolution typing, no missing loci.
b
Multi-Region Simulated Typing Ambiguity – Mixture of first-field DNA-based allele group typing or NMDP ‘‘XX” allele codes (20% of typings), SSO intermediate resolution
typing (40% of typings), single-pass SBT (30% of typings), and high-resolution typing (10% of typings), with missing data at some HLA loci - HLA-C (80% missing), HLA-DRB1
(10% missing), HLA-DQB1 (90% missing).
c
CT Typing Ambiguity – TBD.
to simulate input HLA typing and the estimated haplotype fre- 3. Results
quency distribution output from EM. KL divergence of 0 indicates
that the two distributions are identical. When a sampled haplotype 3.1. Multi-region EM outline
did not appear in the EM estimates or vice versa, we introduced the
missing haplotype at low frequency using prior for each haplotype Building upon our imputation framework GRIMM [4], we
of 1/4n, where n is the number of different haplotypes in the real developed an iterative EM algorithm to estimate haplotype fre-
population simulation. After adding the missing haplotypes, each quencies for heterogeneous population samples (See Fig. 2 and
list had exactly M haplotypes. We then normalized the sum of all methods for EM outline). GRIMM has the advantage of efficiently
probabilities to 1. utilizing partial haplotypes to impute subjects whose haplotypes
0 1 did not appear in the reference. The EM pipeline utilized 2 stages
PMPi
X
M
Pi B j¼1 ðPj Þ C – a first pass haplotype frequency estimate incorporating all sub-
KL ¼ PM log @ Q A; ð6Þ jects as a single population. Then, a haplotype frequency distribu-
i¼1 j¼1 ðP j Þ PM i
j¼1
ðQ j Þ tion is estimated for each subpopulation label (Fig. 2). Within
each stage, the EM algorithm resolves both phase and allele ambi-
where P i ¼ P simulation ðhi Þ þ 4n
1
and Q i ¼ Pem ðhi Þ þ 4n
1
. guity. At the E stage, each subject is associated with the most
Log-likelihood – For each individual, we calculated the log of probable haplotype pairs (and if needed region labels), and at
the sum of all possible genotypes per individual and computed the M phase, the haplotype frequencies are accumulated across
the average log-likelihood across all individuals in the dataset. all subjects.
PK PN
j logð i k Pj ðGi ÞÞ
log ðLÞ ¼ ; ð7Þ
k 3.2. Comparing EM performance on simulated datasets
where K is the number of individuals, and Nk – number of geno-
types result of person k. To measure the accuracy of multi-region EM, we compared per-
The expected values are between the log-likelihood after the formance in recovering the true sampled haplotypes as measured
seeding iteration to 0 when a value closer to 0 indicates better by Kullback-Leibler (KL) divergence to existing algorithms using
performance. simulated and real datasets of NMDP registry HLA typings. The
751
SR EM:
1. Seeding frequencies. SR stage Seeding frequencies
2. Apply the following steps Maximum_number_of_iterations or until converging:

2.1. Generate a new graph database from frequencies list. Generating a new graph database
2.2. Call MR GRIMM imputation, but without relation to the population.
2.3. Calculate the probabilities: MR GRIMM imputation
2.3.1. Normalize the genotypes frequency sum of each person to 1. without relation to the population
2.3.2. For each calculate .
2.3.3. Normalize the all haplotypes sum to 1. Calculate the probabilities:
Normalize the genotypes sum of each person to 1
2.3.4. Update frequencies list.
Calculate
Normalize the all haplotypes sum to 1
MR EM: Seeding frequencies by SR

1. Seeding frequencies.
MR stage
2. Apply the following steps Maximum_number_of_iterations or until converging: Generating a new graph database
2.1. Generate a new graph database from frequencies list.
2.2. Call MR GRIMM imputation.
2.3. Calculate the probabilities: MR GRIMM imputation
2.3.1. Normalize the genotypes sum of each person to 1.
Calculate the probabilities:
2.3.2. For each calculate .
Normalize the genotypes sum of each person to 1
2.3.3. Normalize the all haplotypes sum to 1.
Calculate
2.3.4. Compute to be a sum over all haplotypes in population .
2.3.5. Save for the imputation. Normalize the all haplotypes sum to 1
Compute to be a sum over all haplotypes in population i
2.3.6. Normalize to 1 for each .
Save for the imputation
2.3.7. Update frequencies list. Normalize to 1 for each i
Fig. 2. Pseudocode description and schematic diagram of single-region and multi-region stages of graph-based EM algorithm. The EM algorithm is divided to two stages. At
the first stage, the haplotype frequencies are estimated for the entire general population sample. At the second stage, the frequencies of each haplotype are estimated in

subpopulations considering region as a prior. Haplotype frequencies are estimated at each stage using EM until one of the convergence criteria is fulfilled. PðG Hn ; Hj jk Þ
represents the product of the probabilities of Hn and Hj given that the genotype ðHn ; Hj Þ is consistent with the typing of donor k.
simulations had different population sizes, number of populations, 3.3. Convergence of single-region EM
and typing resolution (i.e. levels of typing ambiguity). Since all
other existing algorithms are inherently single-region implemen- To evaluate the performance of the single-region EM, we ran the
tations, we compared output from our stage 1 single-region EM entire sample assuming every individual belonged to the same
output. Multi-region EM performance was compared only to our population. A single run was performed (as our EM is not stochas-
single-region EM. tic and does not require random initial conditions).
Fig. 3. Single-region EM performance measures. (A-D) SR GRIMME convergence of 3 simulations: AAFA-CARB, AAFA-NAMER-3000, High res CT-10000. (A) Normalized Log-
likelihood per donor across iterations. (B) Change in haplotype frequencies between iteration t and iteration t + 1. (C) The ratio between the number of true sampled
haplotypes for simulation and estimated haplotypes. (D) The KL divergence from the computed haplotype frequency to the real one. (E-F) comparison between frequencies
estimated via SR GRIMME, Hapl-o-Mat and NMDP EM, on all 13 simulations (E) shows the difference between the negative log-likelihood. (F) Shows the difference between
the KL.
752
We used four measures to illustrate convergence of the single- increases with additional iterations. The number of estimated hap-
region EM: lotypes decreases over time, indicating that some of the true sam-
pled haplotypes are lost from the estimated frequencies from EM.
1. Average Log-likelihood per individual at each iteration (Fig. 3A). This leads to an increase in the KL divergence given the penalty to
2. Change in the sum of haplotype frequencies between iterations non-existing haplotypes in the estimated haplotype frequencies.
(Fig. 3B). Many of the components of GRIMM EM (GRIMME) aim at reducing
3. The ratio between the true sampled haplotypes and estimated the total number of candidate haplotypes. This leads to higher like-
haplotypes from EM at each iteration (Fig. 3C). lihoods, but slightly worse KL divergence.
4. The KL divergence between the true sampled haplotype fre-
quency distribution and the estimated haplotype frequency dis- 3.4. Single-region EM performance on complete high-resolution typed
tribution from EM (Fig. 3D). dataset
In most simulations, the single-region EM converged within less We first compared the performance of our single-region graph-
than 20 iterations, with little change after the first 4–5 iterations based EM to several published EM algorithms: Haplo-stats [13],
for most of the above measures. Less ambiguous data converges NMDP EM [16], and Hapl-o-Mat [12] on a dataset of complete
faster. Since most of the data in all the simulations is not very high-resolution typing. Here the task for EM was only to resolve
ambiguous, all the simulations converge within a low number of phase ambiguity. Haplo-stats cannot handle allelic ambiguity, thus
iterations. The log-likelihood, the change in frequencies, and recov- its performance was compared only for this ‘‘Freq_CAU” simula-
ered haplotype ratio all converge. The KL divergence slightly tion, the only dataset it could run. The performance of Haplo-
Table 3
Simulation Complexity level Algorithm Time (s) KL If Log-likelihood percent of data processed
Freq_CAU 1 GRIMME 284 0.0248 0.907 14.894 100

Hapl-o-Mat 1 0.0263 0.907 14.894 100
NMDP EM 25 0.02 0.906 14.894 100
Haplo.stats 25 0.0285 0.907 14.893 100
AAFA-CARB 2 GRIMME 13,017 0.2785 0.641 14.608 100
Hapl-o-Mat 306 0.3136 0.482 19.152 40
NMDP EM 1837 0.107 0.682 14.796 100
AAFA-CARB-3000 2 GRIMME 4083 0.273 0.492 13.999 100
Hapl-o-Mat 79 0.38 0.359 26.191 39.4
NMDP EM 1581 0.206 0.557 13.949 100
AAFA-NAMER 4 GRIMME 84,044 0.322 0.402 13.67 100
Hapl-o-Mat 7425 1.83 0.024 40.74 6.05
NMDP EM 79,583 0.074 0.487 13.41 100
AAFA-NAMER-3000 4 GRIMME 20,351 0.59 0.17 12.791 100
Hapl-o-Mat 1122 1.62 0.003 27.205 6.26
NMDP EM 16,776 0.116 0.367 12.97 100
FILII-NAMER 4 GRIMME 115,092 0.229 0.476 12.561 100
Hapl-o-Mat 7847 1.646 0.137 35.532 6.24
NMDP EM 33,007 0.055 0.534 17.167 100
FILII-NAMER-3000 4 GRIMME 53,833 0.603 0.28 11.745 100
Hapl-o-Mat 977 1.79 0.09 25.821 6.63
NMDP EM 30,576 0.062 0.427 12.11 100
MENAFC-NAMER 4 GRIMME 69,409 0.258 0.47 12.982 100
Hapl-o-Mat 6978 1.74 0.08 38.257 6.1
NMDP EM 28,194 0.063 0.558 12.656 100
MENAFC-NAMER-3000 4 GRIMME 18,399 0.549 0.256 12.274 100
Hapl-o-Mat 1334 1.85 0.007 34.611 5.73
NMDP EM 20,536 0.079 0.428 12.377 100
High res CT-1000 3 GRIMME 1862 0.629 0.296 11.313 100
Hapl-o-Mat 511 0.848 0.268 15.223 95.6
NMDP EM 1298 0.735 0.239 12.772 100
High res CT-3000 3 GRIMME 3210 0.638 0.463 12.556 100
Hapl-o-Mat 1152 0.77 0.343 15.398 95.67
NMDP EM 1393 0.934 0.225 14.42 100
High res CT-10000 3 GRIMME 18,572 0.58 0.43 13.048 100
Hapl-o-Mat 8534 0.844 0.351 20.257 95.46
NMDP EM 3034 1.08 0.19 15.672 100
The complete High res CT 3 GRIMME 80,278 0.367 0.498 15.972 100
Hapl-o-Mat XXXX XXXX XXXX XXXX XXXX
NMDP EM 9277 1.427 0.119 19.69 100
Single-region graph-based EM performance comparison among Hapl-o-Mat, NMDP EM, and Haplo.stats for each dataset for the measures of runtime, log-likelihood, KL
P
divergence, percentage of data processed according to the limitations of the algorithm and If [18];If ¼ i minðpi ; qi Þ, where P is the true sampled frequency and Q is the
estimated frequency. The complexity level (as defined in section 2.8) is mentioned for each dataset. One can clearly see that the value of If decreases with the complexity and
ambiguity of the simulation.
753
stats and all the algorithms that we tested were very close for the 10000 and the complete High res CT), the SR GRIMME was better
Freq_CAU case, where only phase ambiguity must be resolved. In able to recover the true haplotypes and had a lower KL divergence.
terms of runtime performance, our graph-based EM was much For the heterogeneous population simulations, the log-
slower than existing algorithms (Table 3). All EM implementations likelihoods of SR GRIMME were either similar to the NMDP EM
also had similar log-likelihoods for this dataset. or much higher, and always higher thatn the Hapl-o-Mat algorithm
(Fig. 3E, F, Table 3).
The run-time differences between algorithms vary according to
the ambiguity of the input HLA typing. In the less ambiguous sim-
3.5. Graph-based outperforms existing algorithms on more ambiguous
ulations, the Hapl-o-Mat runtime was significantly lower than
heterogeneous datasets of HLA typing
GRIMME. In the other simulations, the Hapl-o-Mat runtime per
donor was significantly slower than GRIMME per donor, but the
Next, we tested SR GRIMM EM on more complex datasets,
total run time was lower since much fewer donors were processed.
where we see a significant advantage over existing algorithms.
The GRIMME and NMDP-EM had a similar order of run times.
SR GRIMME and NMDP EM could manage all tested population
To summarize, in very simple setups, all algorithms can be used,
sizes and any degree of typing ambiguity, in contrast with Haplo-
and Hapl-o-Mat is by far the fastest. In slightly more complex set-
stats and Hapl-o-Mat. In Hapl-o-Mat, typings that split into more
ups, where all loci are typed and ambiguity is limited, NMDP EM
than 1,000,000 haplotypes are completely discarded from the anal-
and GRIMM perform similarly. As soon as either not all loci are
ysis, so only a fraction of individuals are used to estimate the hap-
typed or the ambiguity is very large, GRIMME has a higher log-
lotype frequencies. Then all samples were imputed, and the
likelihood than all other algorithms.
likelihood was computed for all samples. In the heterogeneous
multi-region simulated datasets (AAFA-NAMER, FILII-NAMER,
MENAFC-NAMER), around 94% of the simulated donors could be
expanded to more than a million possible HLA genotypes, therefor 3.6. Multi-region EM has slightly higher log-likelihood than single-
Hapl-o-Mat can perform the EM just on 6% of the donors, and this region EM
is reflected in the results. For simulations involving less typing
ambiguity, such as AAFA-CARB, about 60% of the typings were dis- We next tested the performance of the 2nd stage of the multi-
carded in Haplo-Mat. Finally, in the real CT datasets of NMDP region graph-based EM, where region identification contributed
recruitment typings, about 5% of the subjects were discarded to the calculation of priors that allowed for separate frequency dis-
(Table 3). tributions to be calculated for each region label. For all simulations,
In all cases, the GRIMME and NMDP-EM estimations were closer the multi-region EM allowed the frequency distributions to reach a
than Hapl-o-Mat to the true haplotype distributions, as measured higher log likelihood, however, the difference was relatively small
by KL divergence. In general, for the more homogenous population (Fig. 4A, B, Table 4). This indicates that recovery of rare and fre-
datasets (Freq_CAU, AAFA_CARB, AAFA_CARB_3000, AAFA-NAMER, quent haplotypes was largely achievable with a single-region EM.
AAFA-NAMER-3000, FILII-NAMER, FILII-NAMER-3000, MENAFC- Many of the NMDP population combinations chosen [5], have some
NAMER, and MENAFC-NAMER-3000), the KL divergence was lower shared ancestry and contain overlapping sets of haplotypes. We
for NMDP EM than SR GRIMME. However, for the more heteroge- also note that reporting of region self-identification at the finer
neous datasets (High res CT-1000, High res CT-3000, High res CT- scale is not always consistent with true ancestry [6].
Fig. 4. Multi-region EM performance measures. (A-B) Comparison between the log-likelihood of SR-GRIMME via MR GRIMME. (A) Log-likelihood across iterations for single-
region stage EM followed by multi-region EM stage for simulations: High res CT-10000, AAFA-NAMER-3000, and AAFA-CARB. Discontinuity indicates earlier convergence for
single-region EM before Iteration 13. (B) Bar plot of SR versus MR negative log-likelihood of all 12 simulations. (C) KL Divergence between subpopulation haplotype frequency
distributions in multi-region EM stage. The KL of AAFA population vs NAMER population in simulations: High res CT-10000 and AAFA-NAMER-3000, and The KL of AAFA
population vs CARB population in simulation AAFA-CARB. (D) KL Divergence of subpopulation haplotype frequencies from results of the overall sample from the single- region
EM stage. (E) Percentage of individuals whose true region labels could be recovered by assignment by multi-region EM. (F) The relative proportion of each region from the
general population vs the real part of each region from the general population in each iteration, in simulations AAFA-CARB and AAFA-NAMER-3000. (G) The relative
proportion of each region from the general population vs the real part of each region from the general population in the last iteration of the High res CT-10000 simulation.
754
Table 4
Simulation SR Log-likelihood MR Log-Likelihood SR time (sec) MR time (sec)

AAFA-CARB 14.608 14.029 13,017 27,540
AAFA-CARB-3000 13.999 13.184 4083 1185
AAFA-NAMER 13.67 12.493 84,044 77,808
AAFA-NAMER-3000 12.791 12.039 20,351 6275
FILII-NAMER 12.561 11.03 115,092 31,524
FILII-NAMER-3000 11.745 10.882 53,833 4827
MENAFC-NAMER 12.982 12.021 69,409 40,997
MENAFC-NAMER-3000 12.274 11.704 18,399 4508
High res CT-1000 11.313 9.01 1862 4943
High res CT-3000 12.556 10.599 3210 11,186
High res CT-10000 13.048 11.176 18,572 52,379
The complete High res CT 15.972 11.028 80,278 330,000
Multi-region graph-based EM performance metrics of log-likelihood and runtime for the single-region and multi-region stages.
3.7. Divergence of multi-region EM sub-population estimates from patients [17]. The historical exclusion of mixed-region and
single-region estimate unknown region individuals from samples that generate reference
population HLA frequency datasets has been a known limitation of
In the multi-region EM stage, separate frequency distributions current HLA imputation implementations for the NMDP registry
are estimated for each subpopulation. We measured the frequency [2,16]. In this study, we investigated a strategy where we used a
difference between the 2 region categories for the multi-region multi-stage expectation–maximization (EM) algorithm to provide
simulations and also the divergence between each population haplotype frequency estimates based on all HLA-typed individuals
and the haplotype frequencies that resulted from the single- regardless of region label.
region stage. Because each subpopulation frequency is the same We here present a Multi-Region EM algorithm as an extension
at initialization, the KL divergence starts at zero. For the simula- of the GRIMM imputation algorithm to compute haplotype fre-
tions where there are exactly two populations, the distance stea- quency estimates for a mixture of genotypes from both single-
dily increases until convergence. In the more heterogeneous CT region and multi-region individuals. This is the first algorithm that
dataset, the distance decreases slightly in the second iteration then can automatically estimate haplotype frequencies for each sub-
stabilizes (Fig. 4C). When comparing subpopulation frequency to population when individuals in the sample have membership in
the results of the single-region, frequency distributions for all pop- multiple populations.
ulations moved away from the initialization frequencies until con- Overall, the multi-region EM algorithm is composed of two sep-
vergence. (Fig. 4D). arate stages of haplotype frequency estimation: A) A single-region
EM based on all HLA typings, including those with missing loci,
using as initial conditions the haplotypes frequencies of every indi-
3.8. Recovery of true sample region labels from heterogeneous datasets
vidual in the sample with HLA genotypes where all loci were typed
and the ambiguity is limited, and B) A multi-region EM using as
Divergence in haplotype frequency distributions between sub-
initial conditions the results of the previous stage. After the EM
populations is entirely driven by the different contributions of
computes an estimate of haplotype frequencies of all individuals,
region identification to the priors. Because the region labels are
it can estimate the frequencies of each sub-population. Self-
not deterministic, the assigned region label by EM may differ from
identified region is used as a prior to compute the probability of
the region label of the sampled haplotype used to simulate the
each population group combination. We find that the multi-
multi-region HLA typings. We measured the fraction of donors
region EM algorithm can work with noisy priors, taking into
for which the most probable estimated region combination pre-
account that the single population estimates from the previous
cisely matches the SIRE categories (Fig. 4E). We found that for
stage may not be accurate. In that way, it can handle many small
the vast majority of donors (over 97%) the most probable region
sub-populations.
combination is the SIRE.
The single-region EM algorithm assumes populations are in
To further test the difference between the SIRE and the esti-
Hardy-Weinberg equilibrium (HWE); we assume that HWE
mated region, we computed the fraction of haplotype frequency
between the haplotypes, as well as sub-haplotypes. Note that
associated with this population compared with the fraction
sub-haplotypes are only used when we have no appropriate haplo-
assigned by the SIRE (the normalized F i by EM results). The total
types in the frequency list. However, such haplotypes are then cre-
probability assigned by EM to a population is compared with the
ated. Thus, in the final iterations of the EM, we only use the
fraction of haplotypes that had this population as a SIRE in the true
solution with full haplotypes, where HWE is indeed assumed.
sampled haplotypes. All ratios start at 1 in all simulations. In the
However, this multi-region EM algorithm does not require the
last iteration of the High res CT-10000 simulation, the ratios were
overall dataset to be in HWE because in the final stage each haplo-
slightly above 1 for 8 of the populations, of which 5 were among
type receives an independent Bayesian prior for its population
the largest populations. The ratios of 10 populations were below
label. HWE testing is typically one of the first steps of HLA popula-
0.95, and 6 of them were below 0.8. In the other simulations, the
tion genetic analysis but is not a prerequisite for this approach. The
ratio deviated only slightly from 1 and remained stable after the
real datasets we tested here are known to be heterogeneous and
first few iterations (Fig. 4F, G).
out of HWE. Accurate haplotype frequency estimates for heteroge-
neous samples with multi-region individuals are achievable as
4. Discussion long as the source populations are captured with adequate depth.
We have implemented the imputation using GRIMM. In
Accurate HLA haplotype frequencies are essential for donor HLA GRIMM, one uses a parameter epsilon, and only the Cartesian pro-
imputation in stem cell registries and finding matches for the duct of pair has probability higher than epsilon, we take it into
755
account. If no pair was found, we reduce epsilon. Therefore, if the which extends here to an efficient multi-region implementation.
prior for pairs that are not exactly the region or the single region GRIMM can impute highly ambiguous HLA typing data including
is smaller than epsilon, those pairs will not be taken into account missing loci. We have extended GRIMM such that it can also
in the first iteration In the next iteration if no solution was found, impute subjects whose haplotypes do not appear in the reference
we reduce epsilon and allow for less probable pairs. Therefore, data. This feature is typically not needed for single-stage EM algo-
huge differences are obtained in the priors, not because it really rithms, but it is crucial for the ultimate inclusion of all individuals
is the probable of combinations but because GRIMM ranks the in our multi-stage EM. This EM implementation based on GRIMM
combinations that way. Due to the region combination priors, the also has no practical limit on the number of possible haplotypes
algorithm can utilize all individuals from the registries while in or size of the populations.
previous algorithms SIREs of declines/multi/other/unknown have The main limitation of the current implementation is the com-
been excluded. In the US sample of 10 million had 37 K declines, putational cost, which is higher than Hapl-o-Mat and similar to the
868 K multi, 198 K other, and 30 K unknown, thus >10% of the NMDP-EM. The computational cost can be mitigated by rewriting
US donors have thus far been excluded and can now be included parts of the algorithm in a more efficient programming language.
in the GRIMM-based analysis. At the end of the EM, each haplotype Until then we recommend other EM algorithms continue to be
will be assigned with different probabilities to the existing regions. used for datasets without allelic ambiguity since they are faster.
No new regions are created in the process. This EM algorithm still utilizes existing region labels captured by
The EM algorithm performance was measured in two different NMDP, which do not align perfectly with genetic ancestry. In the
ways, log-likelihood and the Kullback-Leibler divergence. All EM future, unsupervised clustering based only on the HLA data with-
algorithms are designed to maximize log-likelihood, which would out region labels could potentially eliminate such limitations.
explain all individuals in the sample using as few haplotypes as We anticipate that the improved HLA frequency estimates that
possible, yet remaining unbiased in which haplotypes are selected. will result from this novel EM approach when coupled with a new
In the expectation, the haplotype pairs were detected by one of the imputation algorithm that can assign region population labels to
two solutions. EHS reduces haplotypic diversity, and de-facto when each HLA haplotype of an individual, will improve the ability to
one haplotype is found we apply MHS just to one haplotype. We identify HLA-matched donors in stem cell registries. Continued
limit the haplotypic diversity to the one essential to explain the improvements in donor-recipient HLA matching will increase
observed haplotypes to ensure convergence. This improves the access to transplantation for multi-region patients and provide
likelihood and convergence rate. Our single-region EM provided better outcomes for those who receive transplants.
similar or higher log-likelihoods than all other EM algorithms we
tested. The addition of multi-region estimation in the final stage
provided even higher log-likelihoods. But the MR stage has a very Declaration of Competing Interest
limited impact on the Likelihood. Thus, while the division of pop-
ulations improves imputation, the improvement is often marginal. The authors declare that they have no known competing finan-
Thus, the optimal number of sub-populations should be lower than cial interests or personal relationships that could have appeared
the number of regions. Kullback-Leibler (KL) divergence can mea- to influence the work reported in this paper.
sure EM algorithm performance in recovering sample-generating
haplotype distributions for simulated datasets. We found that the
Appendix A. Supplementary data
KL-divergence was also lower for our algorithm compared to the
Hapl-o-Mat implementation. For the simulated datasets, we
Supplementary data to this article can be found online at
selected pairs of populations with a range of genetic distances.
https://doi.org/10.1016/j.humimm.2021.07.001.
The more distinctive population pairs were AAFA-NAMER and
FILII-NAMER. The AAFA-CARB and MENAFC-NAMER populations
were far less distant from one another [2]. We find that parallel References
estimation gives better estimates because Hardy-Weinberg equi-
librium applies within the assigned population subsamples at [1] T.M. Williams, Human leukocyte antigen gene polymorphism and the
histocompatibility laboratory, J. Mol. Diagn. 3 (3) (2001) 98–104, https://doi.
Stage 2 better than it does to the overall sample. The assignment
org/10.1016/S1525-1578 (10)60658-7.
happens at the end of stage 1 and individuals are never reassigned. [2] L. Gragert, A. Madbouly, J. Freeman, M. Maiers, Six-locus high resolution HLA
The joint estimation assumes the overall sample is in HWE. Our haplotype frequencies derived from mixed-resolution DNA typing for the
results indicate the other two EM models perform relatively poorly entire US donor registry, Hum. Immunol. 74 (10) (2013) 1313–1320, https://
doi.org/10.1016/j.humimm.2013.06.025.
in situations of heterogeneous samples because they assume HWE [3] C. Kollman, M. Maiers, L. Gragert, C. Müller, M. Setterholm, M. Oudshoorn, C.K.
applies to the whole sample. Hurley, Estimation of HLA-A, -B, -DRB1 haplotype frequencies using mixed
An important aspect of the current algorithm is the implemen- resolution data from a national registry with selective retyping of volunteers,
Hum. Immunol. 68 (12) (2007) 950–958, https://doi.org/10.1016/j.
tation of multiple components to limit the total number of haplo- humimm.2007.10.009.
types, including, seeding with only unambiguous typings, or [4] M. Maiers et al., GRIMM: GRaph IMputation and matching for HLA genotypes,
limiting the total number of allowed results per typing, and not Bioinformatics, 35 (18), 3520–3523, 2019, doi: 10.1093/bioinformatics/btz050.
[5] M. Maiers, L. Gragert, W. Klitz, High-resolution HLA alleles and haplotypes in
creating novel haplotypes when a solution with existing haplo- the United States population, Hum. Immunol. 68 (9) (2007) 779–788, https://
types can be found. All these steps reduce the total number of hap- doi.org/10.1016/j.humimm.2007.04.005.
lotypes and by such lead to missing haplotypes. On the other hand, [6] N. Risch, E. Burchard, E. Ziv, H. Tang, Categorization of humans in biomedical
research: Genes, race and disease, Genome Biol. 3 (7) (2002) 1–12, https://doi.
it maximized the likelihood of observed typings, and it hastens org/10.1186/gb-2002-3-7-comment2007.
convergence. We propose that limiting haplotypic diversity is [7] J. a. Hollenbach et al., Race, ethnicity and ancestry in unrelated transplant
advantageous, and enlarging the haplotype list to a large number matching for the national marrow donor program: A comparison of multiple
forms of self-identification with genetics., PLoS One, 10 (8), e0135960, 2015,
of putative haplotypes may have limited utility.
doi: 10.1371/journal.pone.0135960.
We previously described a graph-based imputation and match- [8] V. Damotte et al., Multiple measures reveal the value of both race and
ing algorithm called GRIMM that allows for the rapid estimate of geographic ancestry for self-identification, bioRxiv (2019), https://doi.org/
haplotypes best explaining a new input typing using all possible 10.1101/701698.
[9] W.R. Brubaker, Immigration, citizenship, and the nation-state in France and
allele combinations consistent with the typing. This approach Germany: A comparative historical analysis, Int. Sociol. 5 (4) (1990) 379–407,
solves each possible phase of unphased genotypes separately, https://doi.org/10.1177/026858090005004003.
756
[10] W. Bochtler et al., A comparative reference study for the validation of hla- [15] M. Stephens, N.J. Smith, P. Donnelly, A new statistical method for haplotype
matching algorithms in the search for allogeneic hematopoietic stem cell reconstruction from population data, Am. J. Hum. Genet. 68 (4) (2001) 978–
donors and cord blood units, Hla, 87 (6), 2016, doi: 10.1111/tan.12817. 989, https://doi.org/10.1086/319501.
[11] R.P. Milius et al., Genotype list string: A grammar for describing HLA and KIR [16] J. Dehn, M. Setterholm, K. Buck, J. Kempenich, B. Beduhn, L. Gragert, A.
genotyping results in a text string, Tissue Antigens 82 (2) (2013) 106–112, Madbouly, S. Fingerson, M. Maiers, HapLogic: A predictive human leukocyte
https://doi.org/10.1111/tan.12150. antigen-matching algorithm to enhance rapid identification of the optimal
[12] C. Schäfer, A.H. Schmidt, J. Sauter, Hapl-o-Mat: Open-source software for HLA unrelated hematopoietic stem cell sources for transplantation, Biol. Blood
haplotype frequency estimation from ambiguous and heterogeneous data, Marrow Transplant. 22 (11) (2016) 2038–2046, https://doi.org/10.1016/j.
BMC Bioinformatics 18 (1) (2017) 1–10, https://doi.org/10.1186/s12859-017- bbmt.2016.07.022.
1692-y. [17] J. Sauter, U.V. Solloch, A.S. Giani, J.A. Hofmann, A.H. Schmidt, Simulation shows
[13] J.P. Sinnwell, D.J. Schaid, Haplo Stats statistical methods for haplotypes when that HLA-matched stem cell donors can remain unidentified in donor searches,
linkage phase is ambiguous, Sci. York, 2011. Sci. Rep. 6 (2016) 1–9, https://doi.org/10.1038/srep21149.
[14] A.K. Lancaster, R.M. Single, O.D. Solberg, M.P. Nelson, G. Thomson, PyPop [18] L. Excoffier, M. Slatkin, Maximum-likelihood estimation of molecular
update - A software pipeline for large-scale multilocus population genomics, haplotype frequencies in a diploid population, Mol. Biol. Evol. 12 (5) (1995)
Tissue Antigens 69 (Suppl. 1) (2007) 192–197, https://doi.org/10.1111/j.1399- 921–927, https://doi.org/10.1093/oxfordjournals.molbev.a040269.
0039.2006.00769.x. [19] WMDA Total Number of Donors and Cord blood units. https://statistics.wmda.
info/, (accessed 10 December 2020).
757

1 s2.0 S0198885921001750 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0198885921001750 Main

Uploaded by

Copyright:

Available Formats

Human Immunology 82 (2021) 746–757

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/humimm

HLA haplotype frequency estimation for heterogeneous populations

1. Introduction population-specific reference HLA haplotype frequency distribu-

CARB 0+0.4 0+0.4 0+0.3 NAMER NAMER

AAFA NAMER CARB AAFA AAFA

AAFA 0.2+0+0.3+0.4+0.1 0 0.4 0+0.4 CARB CARB

HLA Typing Dataset Region Groups (size) Size

2. Apply the following steps Maximum_number_of_iterations or until converging:

MR EM: Seeding frequencies by SR

Freq_CAU 1 GRIMME 284 0.0248 0.907 14.894 100

Simulation SR Log-likelihood MR Log-Likelihood SR time (sec) MR time (sec)

You might also like