PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM

White paper | Axiom genotyping arrays
Microarrays
Bigger isn’t always better:

How imputation-aware design FPO
is solving the need to “go big”
Introduction
In this white paper, we: The analysis of the human genome continues to bring benefits to scientists and
• Define imputation and describe how clinicians. One of the keys for success in these efforts is the ability to accurately
it facilitates genome-wide association
and cost-effectively analyze variations among humans. Microarray technologies are
studies (GWAS)
important contributors to these successes, as demonstrated by efforts such as the
• Introduce imputation-aware Million Veteran Program, UK Biobank studies, and the Taiwan Precision Medicine
microarray design
Initiative. These predictive genomics programs are critical studies that aim to evaluate
• Explore how imputation allows for the correlation between genetic variations and common diseases, disorders, and
smaller, highly accurate arrays drug interactions.
Microarrays facilitate the analysis of genetic variation by analyzing hundreds of

thousands to millions of variants in a single assay. Microarrays allow highly accurate
genotyping of variants directly assayed on the arrays. This includes genotyping of ultra-
rare variants and variants within challenging regions characterized by complex structural
rearrangements and high sequence homology [1]. Notably, microarrays have proven
highly accurate in predictive genomics applications such as pharmacogenomics [2]
and blood group antigen typing [3,4]. Moreover, microarrays offer high genome-wide
imputation accuracy across diverse populations for applying polygenic risk scores (PRS)
in predicting disease risk across individuals with different ethnic backgrounds.
Briefly, immobilized oligonucleotides of known sequence hybridize to labeled genomic

DNA. The presence or absence of known sequences are subsequently detected by
imaging. Thus, specific genotypes can be quickly determined. At the simplest level,
the oligonucleotide sequences on the chip are choosing variants that are meant to be
queried on a microarray. And to query all the possible variants, a very large number
of sequences must be designed and immobilized on the array. This requirement for
large array capacity can introduce practical limitations on the number of sequences
that can be analyzed. Choices made due to these limitations can result in incomplete
data, reducing the statistical power and limiting the detection of significant genetic
associations. These missed associations have the potential to greatly impact study
results, ultimately impeding researchers and clinicians alike.
However, much more genetic information can be gathered Importance of choosing haplotypes that
indirectly from these types of experiments. Although facilitate imputation
recombination shuffles entire genomes with every meiotic event, In principle, imputation can be used to infer genotypic information
it is accepted that variants that are close to each other are less in any large set of oligonucleotide sequences. By judiciously
likely to be separated by recombination. Segments containing choosing sequences with imputation in mind, the number of
such variants are therefore usually inherited together and are in oligos required for whole-genome coverage can be minimized,
linkage disequilibrium (LD), referred to as haplotypes. A linear reducing costs and increasing efficiency of the analysis.
oligonucleotide sequence that is immobilized on a chip will For example, choosing oligonucleotide sequences that are
therefore detect a single, specific haplotype containing any haplotypes commonly found in African populations increases the
variants within that sequence. Having information of a single utility of such arrays for use with African populations as a whole.
variant in an oligo can provide information about all variants These imputation-aware design strategies for microarrays have
in that haplotype—even if the haplotype extends across been shown to improve power when used for large multi-ethnic
several oligonucleotides. GWAS [6].
This “filling in” of missing genetic information using known Applied Biosystems™ Axiom™ arrays used for genome-wide
haplotypes is known as imputation. More formally, imputation is research studies are designed, using a proprietary imputation-
the statistical inference of genotypes not directly queried in an based SNP selection method, to optimally select an efficient
experiment. Information on known haplotypes in a population set of markers that cover common and low-frequency variations
is usually obtained from external databases, such as the in the appropriate populations. These algorithms have been
databases from phase 3 of the 1000 Genomes Project and the implemented to design arrays that have been used in population-
Trans-Omics for Precision Medicine (TOPMed) program. These wide variation studies, including the UK Biobank [7], Taiwan
known haplotypes are then used to infer which variants are likely Precision Medicine project [8], Korea Biobank Project [9,] and the
to be present in adjacent sequences. Imputation increases the U.S. Department of Veteran Affairs’ Million Veterans Program [10].
practical SNP density of an experiment, reducing the distance Similarly, imputation-aware Axiom arrays have been designed for
between known SNPs for which genetic information can be the investigation of prostate and other cancers in African men [11]
obtained. Imputation and the resulting increased SNP density and for investigating macular degeneration [12]. The successes of
help tremendously in narrowing down the location of probable these projects have been facilitated by the ease at which low-
causal variants in genome-wide association studies (GWAS) that frequency variants can be imputed for the specific populations for
test for association between a trait of interest (e.g., a disease) and which the arrays have been designed.
experimentally untyped genetic variants [5].
2 Microarrays thermofisher.com/microarrays
Basics of imputation-aware designs used in Axiom arrays
Thermo Fisher Scientific has developed a proprietary imputation-aware design pipeline.
This automated pipeline has five steps (Figure 1):
A. Selection of markers is initiated from a pool that holds D. The imputation-aware aspect of GWAS backbone
experimentally verified, high-performing probe sets selection is illustrated in Figure 1D. In the pairwise part
(Figure 1A). of the algorithm, the algorithm considers only pairwise
LD. Additional target markers will be identified to cover
B. A reference panel is used to capture genomic variation in the
by imputation (from the markers that have already been
target populations. For example, for a European array, the
selected). If a target marker is covered by imputation in
1000 Genomes Project reference panel would be used as a
some population, that marker does not need to be covered
reference and supplemented with other appropriate panels
again in that population. Here, additional target markers are
as needed (Figure 1B).
identified as being covered by imputation via the previously
C. A “greedy” marker selection occurs as illustrated in Figure selected markers, allowing the next round of marker
1C. The triangles represent the haplotype structure. The selection to target an additional new set of markers that are
algorithm simultaneously considers haplotype structures of not yet covered.
the multiple populations. After considering markers already
E. After identifying markers covered by imputation, step C
selected for the array, each verified marker is evaluated for
is repeated and another set of markers is selected using
its marginal contribution to coverage of the genome. The
pairwise LD, continuing until the array is filled or the target
design process ends picking up more than one marker per
marker set is exhausted.
haplotype block to cover the different frequency ranges.
Because the process considers multiple populations
together, it will preferentially select markers that contribute to
all populations. The process iteratively selects ~10% of the
markers and then evaluates imputation coverage to optimize
the next selection.
A B
Verified markers
D C
Covered pairwise Covered by imputation Selected markers Covered (r2 ≥ 0.8)
Figure 1. Steps for imputation-aware microarray probe design. (A) Use all known SNPs and indels. (B) ~11M unique markers from Axiom
database. (C) Use a “greedy” algorithm to iteratively select verified markers to tag additional marker with pairwise r 2. (D) Additional target markers will
be identified to cover by imputation.
Relationship between SNP density and probe Axiom array that has been designed to be optimized for
imputation accuracy imputation using the algorithms described above can produce
Microarrays can be designed to have SNPs that are chosen the same accuracy of a 2M probe array (Figure 2, red). In a similar
to query functional variants or convey other structural features study, Nguyen et al. compared the imputation accuracy of a
such as CNVs. While increasing numbers of SNPs can provide wide variety of commercially available microarrays. This study
more coverage for these variants on the chip, higher SNP found that having more SNPs on a chip does not necessarily
density does not necessarily mean increasing imputation translate to higher accuracy. The authors note that “in order to
accuracy. This is because the sequences chosen to query the obtain high imputation performances, the choice of an array
SNP may not include important haplotype information that can is not necessarily about getting higher density, but small- to
be used to impute the surrounding sequences. For example, moderately-sized arrays, accompanied by well optimization (sic)
microarrays with less than 800K unoptimized SNP probes can for the targeted population, could also produce high imputation
be used to impute genotypes from different populations with and PGS performance” [13]. These results imply that by using
a predictable accuracy (Figure 2, blue). Increasing the number designs with imputation-aware algorithm, genome-wide
of unoptimized SNP probes to 2M results in an increase in information can be extracted from arrays that are manufactured,
imputation accuracy (Figure 2, yellow). However, an 800K used, and analyzed more efficiently.
AFR AMR EAS EUR SAS

1.0
Mean imputation r2
0.9
0.8
Axiom prototype OFH
Illumina GDA
0.7 Illumina GSA
Hypothetical 1.2M
0.6
800K 1M 2M 800K 1M 2M 800K 1M 2M 800K 1M 2M 800K 1M 2M
Number of markers
Figure 2. Imputation accuracy and SNP density. The Axiom prototype array was designed with imputation-aware algorithms; the other microarrays
were not. Data are shown for both common (minor allele frequency (MAF) >5%) and rare (MAF 1–5%) alleles. The light blue and black curves
correspond to maximum achievable imputation accuracy using the 1000 Genomes Phase 3 reference panel for MAF >5% and MAF 1–5% markers,
respectively, as a function of marker density. Note that the performance of the Axiom array conforms to the expected hypothetical performance and is
equal to that of the high-density array across each of the population groups.
Comparison with next-generation sequencing (NGS) 1.00
Another method commonly used for GWAS and population

Average non-reference
studies is whole-genome or whole-exome sequencing. This

allele concordance
0.75
method produces a tremendous amount of data and can uncover 0.1x LPS
0.5x LPS
novel information. However, this power also creates the need 0.50
1x LPS
for complex bioinformatics support for interpretation of results. Axiom UK Biobank
0.25
In addition, the large amounts of data generated necessitate array (typed)
large amounts of storage and facile means to retrieve the data.

0.00
Therefore, there is a need for a solution that can extract genome- <1% 1-2% 2-3% 3-4% 4-5%
wide variant data with high throughput and cost-efficiency Non-reference allele frequency
without compromising accuracy. In some cases, low-pass

sequencing (LPS) has been suggested as a cost-effective method Figure 3. Non-reference allele concordance of low-frequency
variants. The imputation concordance was compared between an
for obtaining genomic information. However, imputation-aware
imputation-aware Axiom array and different low-pass coverages using
microarray designs allow much higher accuracy when genotyping NGS. For this study, European population–enriched samples were used.
of rare variants and variants in difficult regions than 0.1x LPS, and Note that the accuracy of the Axiom array exceeds even low-pass NGS
coverage.
still exceeds the accuracy when compared to the more expensive
0.5x and 1x sequencing coverage (Figure 3). For these reasons,
imputation-aware microarrays have become the platform of
choice for large, population-scale genotyping efforts.
Conclusions References
Modern microarrays provide cost-effective and accurate solutions 1. Mizrahi-Man O, Woehrmann MH, Webster TA, et al. Novel genotyping algorithms
for collecting genomic information for population studies. When for rare variants significantly improve the accuracy of Applied Biosystems™ Axiom™
array genotyping calls: Retrospective evaluation of UK Biobank array data. PLoS One.
an array is designed specifically for optimizing imputation, the 2022;17(11):e0277680. Published 2022 Nov 17. doi:10.1371/journal.pone.0277680
power of the microarray increases without reducing accuracy, 2. Lemieux Perreault LP, Zaïd N, Cameron M, Mongrain I, Dubé MP. Pharmacogenetic
eliminating the need for arrays of increasing size. The Applied content of commercial genome-wide genotyping arrays. Pharmacogenomics.
2018;19(15):1159-1167. doi:10.2217/pgs-2017-0129
Biosystems™ Axiom™ PangenomiX™ Array was designed for
3. Guo Y, Busch MP, Seielstad M, et al. Development and evaluation of a transfusion
whole-genome imputation with diverse and global population medicine genome wide genotyping array. Transfusion. 2019;59(1):101-111.
coverage. More than 800,000 markers were selected for doi:10.1111/trf.15012
high-genome coverage using the database from phase 3 of 4. Lane W (2021). Recent advances in blood group genotyping. Annals Of Blood, 6.
doi:10.21037/aob-21-8
the 1000 Genomes Project, yielding imputable coverage for
5. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat
European, African, admixed American, East Asian, and South Rev Genet. 2010;11(7):499-511. doi:10.1038/nrg2796
Asian populations. Variants prevalent in these populations are 6. Wojcik GL, Fuchsberger C, Taliun D, et al. Imputation-Aware Tag SNP Selection To
Improve Power for Large-Scale, Multi-ethnic Association Studies. G3 (Bethesda).
present with probe design that facilitates imputation. In addition,
2018;8(10):3255-3267. Published 2018 Oct 3. doi:10.1534/g3.118.200502
the Axiom PangenomiX Array has been designed for use in 7. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep
CNV analysis, HLA typing, disease-related variant detection, phenotyping and genomic data. Nature. 2018;562(7726):203-209. doi:10.1038/
blood phenotyping, and pharmacogenomic analysis. This array s41586-018-0579-z
8. Wei CY, Yang JH, Yeh EC, et al. Genetic profiles of 103,106 individuals in the Taiwan
brings ethnic diversity to researchers, enabling them to identify
Biobank provide insights into the health and history of Han Chinese. NPJ Genom Med.
population-specific associations for better understanding of 2021;6(1):10. Published 2021 Feb 11. doi:10.1038/s41525-021-00178-9
complex diseases. For more information, please contact your 9. Moon S, Kim YJ, Han S, et al. The Korea Biobank Array: Design and Identification of
Coding Variants Associated with Blood Biochemical Traits. Sci Rep. 2019;9(1):1382.
representative at thermofisher.com/pangenomixcontact.
Published 2019 Feb 4. doi:10.1038/s41598-018-37832-9
Find out more about microarrays for population-scale testing at 10. Hunter-Zinck H, Shi Y, Li M, et al. Genotyping Array Design and Data Quality Control in
the Million Veteran Program. Am J Hum Genet. 2020;106(4):535-548. doi:10.1016/j.
thermofisher.com/predictivegenomics ajhg.2020.03.004
11. Harlemon M, Ajayi O, Kachambwa P, et al. A Custom Genotyping Array Reveals
Population-Level Heterogeneity for the Genetic Risks of Prostate Cancer and Other
Cancers in Africa. Cancer Res. 2020;80(13):2956-2966. doi:10.1158/0008-5472.
CAN-19-2165
12. Robman LD, Phuong Thao LT, Guymer RH, et al. Baseline characteristics and
age-related macular degeneration in participants of the “ASPirin in Reducing Events
in the Elderly” (ASPREE)-AMD trial. Contemp Clin Trials Commun. 2020;20:100667.
Published 2020 Oct 11. doi:10.1016/j.conctc.2020.100667
13. Nguyen DT, Tran TTH, Tran MH, et al. A comprehensive evaluation of polygenic
score and genotype imputation performances of human SNP arrays in diverse
populations. Sci Rep. 2022;12(1):17556. Published 2022 Oct 20. doi:10.1038/
s41598-022-22215-y
14. Scheet P, Stephens M. A fast and flexible statistical model for large-scale population
genotype data: applications to inferring missing genotypes and haplotypic phase. Am J
Hum Genet. 2006;78(4):629-644. doi:10.1086/502802
Learn more at thermofisher.com/microarrays

For Research Use Only. Not for use in diagnostic procedures. © 2024 Thermo Fisher Scientific Inc. All rights reserved. All
trademarks are the property of Thermo Fisher Scientific and its subsidiaries unless otherwise specified. NIMBUS is a registered trademark
of Hamilton Co. Tecan is a trademark of Tecan Group Ltd. Beckman Coulter is a trademark of Beckman Coulter Inc. COL28204 0124

PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PG2607 PJT11009 COL28204 Imputation Aware Design Whitepaper D1 CG - SMJ - AM

Uploaded by

Copyright:

Available Formats

White paper | Axiom genotyping arrays

Bigger isn’t always better:

Microarrays facilitate the analysis of genetic variation by analyzing hundreds of

Briefly, immobilized oligonucleotides of known sequence hybridize to labeled genomic

Covered pairwise Covered by imputation Selected markers Covered (r2 ≥ 0.8)

AFR AMR EAS EUR SAS

Comparison with next-generation sequencing (NGS) 1.00

Another method commonly used for GWAS and population

studies is whole-genome or whole-exome sequencing. This

large amounts of storage and facile means to retrieve the data.

without compromising accuracy. In some cases, low-pass

Learn more at thermofisher.com/microarrays

You might also like