Professional Documents
Culture Documents
Microarrays
Introduction
In this white paper, we: The analysis of the human genome continues to bring benefits to scientists and
• Define imputation and describe how clinicians. One of the keys for success in these efforts is the ability to accurately
it facilitates genome-wide association
and cost-effectively analyze variations among humans. Microarray technologies are
studies (GWAS)
important contributors to these successes, as demonstrated by efforts such as the
• Introduce imputation-aware Million Veteran Program, UK Biobank studies, and the Taiwan Precision Medicine
microarray design
Initiative. These predictive genomics programs are critical studies that aim to evaluate
• Explore how imputation allows for the correlation between genetic variations and common diseases, disorders, and
smaller, highly accurate arrays drug interactions.
This “filling in” of missing genetic information using known Applied Biosystems™ Axiom™ arrays used for genome-wide
haplotypes is known as imputation. More formally, imputation is research studies are designed, using a proprietary imputation-
the statistical inference of genotypes not directly queried in an based SNP selection method, to optimally select an efficient
experiment. Information on known haplotypes in a population set of markers that cover common and low-frequency variations
is usually obtained from external databases, such as the in the appropriate populations. These algorithms have been
databases from phase 3 of the 1000 Genomes Project and the implemented to design arrays that have been used in population-
Trans-Omics for Precision Medicine (TOPMed) program. These wide variation studies, including the UK Biobank [7], Taiwan
known haplotypes are then used to infer which variants are likely Precision Medicine project [8], Korea Biobank Project [9,] and the
to be present in adjacent sequences. Imputation increases the U.S. Department of Veteran Affairs’ Million Veterans Program [10].
practical SNP density of an experiment, reducing the distance Similarly, imputation-aware Axiom arrays have been designed for
between known SNPs for which genetic information can be the investigation of prostate and other cancers in African men [11]
obtained. Imputation and the resulting increased SNP density and for investigating macular degeneration [12]. The successes of
help tremendously in narrowing down the location of probable these projects have been facilitated by the ease at which low-
causal variants in genome-wide association studies (GWAS) that frequency variants can be imputed for the specific populations for
test for association between a trait of interest (e.g., a disease) and which the arrays have been designed.
experimentally untyped genetic variants [5].
2 Microarrays thermofisher.com/microarrays
Basics of imputation-aware designs used in Axiom arrays
Thermo Fisher Scientific has developed a proprietary imputation-aware design pipeline.
This automated pipeline has five steps (Figure 1):
A. Selection of markers is initiated from a pool that holds D. The imputation-aware aspect of GWAS backbone
experimentally verified, high-performing probe sets selection is illustrated in Figure 1D. In the pairwise part
(Figure 1A). of the algorithm, the algorithm considers only pairwise
LD. Additional target markers will be identified to cover
B. A reference panel is used to capture genomic variation in the
by imputation (from the markers that have already been
target populations. For example, for a European array, the
selected). If a target marker is covered by imputation in
1000 Genomes Project reference panel would be used as a
some population, that marker does not need to be covered
reference and supplemented with other appropriate panels
again in that population. Here, additional target markers are
as needed (Figure 1B).
identified as being covered by imputation via the previously
C. A “greedy” marker selection occurs as illustrated in Figure selected markers, allowing the next round of marker
1C. The triangles represent the haplotype structure. The selection to target an additional new set of markers that are
algorithm simultaneously considers haplotype structures of not yet covered.
the multiple populations. After considering markers already
E. After identifying markers covered by imputation, step C
selected for the array, each verified marker is evaluated for
is repeated and another set of markers is selected using
its marginal contribution to coverage of the genome. The
pairwise LD, continuing until the array is filled or the target
design process ends picking up more than one marker per
marker set is exhausted.
haplotype block to cover the different frequency ranges.
Because the process considers multiple populations
together, it will preferentially select markers that contribute to
all populations. The process iteratively selects ~10% of the
markers and then evaluates imputation coverage to optimize
the next selection.
A B
Verified markers
D C
Figure 1. Steps for imputation-aware microarray probe design. (A) Use all known SNPs and indels. (B) ~11M unique markers from Axiom
database. (C) Use a “greedy” algorithm to iteratively select verified markers to tag additional marker with pairwise r 2. (D) Additional target markers will
be identified to cover by imputation.
3 Microarrays thermofisher.com/microarrays
Relationship between SNP density and probe Axiom array that has been designed to be optimized for
imputation accuracy imputation using the algorithms described above can produce
Microarrays can be designed to have SNPs that are chosen the same accuracy of a 2M probe array (Figure 2, red). In a similar
to query functional variants or convey other structural features study, Nguyen et al. compared the imputation accuracy of a
such as CNVs. While increasing numbers of SNPs can provide wide variety of commercially available microarrays. This study
more coverage for these variants on the chip, higher SNP found that having more SNPs on a chip does not necessarily
density does not necessarily mean increasing imputation translate to higher accuracy. The authors note that “in order to
accuracy. This is because the sequences chosen to query the obtain high imputation performances, the choice of an array
SNP may not include important haplotype information that can is not necessarily about getting higher density, but small- to
be used to impute the surrounding sequences. For example, moderately-sized arrays, accompanied by well optimization (sic)
microarrays with less than 800K unoptimized SNP probes can for the targeted population, could also produce high imputation
be used to impute genotypes from different populations with and PGS performance” [13]. These results imply that by using
a predictable accuracy (Figure 2, blue). Increasing the number designs with imputation-aware algorithm, genome-wide
of unoptimized SNP probes to 2M results in an increase in information can be extracted from arrays that are manufactured,
imputation accuracy (Figure 2, yellow). However, an 800K used, and analyzed more efficiently.
0.9
0.8
Axiom prototype OFH
Illumina GDA
0.7 Illumina GSA
Hypothetical 1.2M
0.6
800K 1M 2M 800K 1M 2M 800K 1M 2M 800K 1M 2M 800K 1M 2M
Number of markers
Figure 2. Imputation accuracy and SNP density. The Axiom prototype array was designed with imputation-aware algorithms; the other microarrays
were not. Data are shown for both common (minor allele frequency (MAF) >5%) and rare (MAF 1–5%) alleles. The light blue and black curves
correspond to maximum achievable imputation accuracy using the 1000 Genomes Phase 3 reference panel for MAF >5% and MAF 1–5% markers,
respectively, as a function of marker density. Note that the performance of the Axiom array conforms to the expected hypothetical performance and is
equal to that of the high-density array across each of the population groups.
0.75
method produces a tremendous amount of data and can uncover 0.1x LPS
0.5x LPS
novel information. However, this power also creates the need 0.50
1x LPS
for complex bioinformatics support for interpretation of results. Axiom UK Biobank
0.25
In addition, the large amounts of data generated necessitate array (typed)
wide variant data with high throughput and cost-efficiency Non-reference allele frequency
4 Microarrays thermofisher.com/microarrays
Conclusions References
Modern microarrays provide cost-effective and accurate solutions 1. Mizrahi-Man O, Woehrmann MH, Webster TA, et al. Novel genotyping algorithms
for collecting genomic information for population studies. When for rare variants significantly improve the accuracy of Applied Biosystems™ Axiom™
array genotyping calls: Retrospective evaluation of UK Biobank array data. PLoS One.
an array is designed specifically for optimizing imputation, the 2022;17(11):e0277680. Published 2022 Nov 17. doi:10.1371/journal.pone.0277680
power of the microarray increases without reducing accuracy, 2. Lemieux Perreault LP, Zaïd N, Cameron M, Mongrain I, Dubé MP. Pharmacogenetic
eliminating the need for arrays of increasing size. The Applied content of commercial genome-wide genotyping arrays. Pharmacogenomics.
2018;19(15):1159-1167. doi:10.2217/pgs-2017-0129
Biosystems™ Axiom™ PangenomiX™ Array was designed for
3. Guo Y, Busch MP, Seielstad M, et al. Development and evaluation of a transfusion
whole-genome imputation with diverse and global population medicine genome wide genotyping array. Transfusion. 2019;59(1):101-111.
coverage. More than 800,000 markers were selected for doi:10.1111/trf.15012
high-genome coverage using the database from phase 3 of 4. Lane W (2021). Recent advances in blood group genotyping. Annals Of Blood, 6.
doi:10.21037/aob-21-8
the 1000 Genomes Project, yielding imputable coverage for
5. Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat
European, African, admixed American, East Asian, and South Rev Genet. 2010;11(7):499-511. doi:10.1038/nrg2796
Asian populations. Variants prevalent in these populations are 6. Wojcik GL, Fuchsberger C, Taliun D, et al. Imputation-Aware Tag SNP Selection To
Improve Power for Large-Scale, Multi-ethnic Association Studies. G3 (Bethesda).
present with probe design that facilitates imputation. In addition,
2018;8(10):3255-3267. Published 2018 Oct 3. doi:10.1534/g3.118.200502
the Axiom PangenomiX Array has been designed for use in 7. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep
CNV analysis, HLA typing, disease-related variant detection, phenotyping and genomic data. Nature. 2018;562(7726):203-209. doi:10.1038/
blood phenotyping, and pharmacogenomic analysis. This array s41586-018-0579-z
8. Wei CY, Yang JH, Yeh EC, et al. Genetic profiles of 103,106 individuals in the Taiwan
brings ethnic diversity to researchers, enabling them to identify
Biobank provide insights into the health and history of Han Chinese. NPJ Genom Med.
population-specific associations for better understanding of 2021;6(1):10. Published 2021 Feb 11. doi:10.1038/s41525-021-00178-9
complex diseases. For more information, please contact your 9. Moon S, Kim YJ, Han S, et al. The Korea Biobank Array: Design and Identification of
Coding Variants Associated with Blood Biochemical Traits. Sci Rep. 2019;9(1):1382.
representative at thermofisher.com/pangenomixcontact.
Published 2019 Feb 4. doi:10.1038/s41598-018-37832-9
Find out more about microarrays for population-scale testing at 10. Hunter-Zinck H, Shi Y, Li M, et al. Genotyping Array Design and Data Quality Control in
the Million Veteran Program. Am J Hum Genet. 2020;106(4):535-548. doi:10.1016/j.
thermofisher.com/predictivegenomics ajhg.2020.03.004
11. Harlemon M, Ajayi O, Kachambwa P, et al. A Custom Genotyping Array Reveals
Population-Level Heterogeneity for the Genetic Risks of Prostate Cancer and Other
Cancers in Africa. Cancer Res. 2020;80(13):2956-2966. doi:10.1158/0008-5472.
CAN-19-2165
12. Robman LD, Phuong Thao LT, Guymer RH, et al. Baseline characteristics and
age-related macular degeneration in participants of the “ASPirin in Reducing Events
in the Elderly” (ASPREE)-AMD trial. Contemp Clin Trials Commun. 2020;20:100667.
Published 2020 Oct 11. doi:10.1016/j.conctc.2020.100667
13. Nguyen DT, Tran TTH, Tran MH, et al. A comprehensive evaluation of polygenic
score and genotype imputation performances of human SNP arrays in diverse
populations. Sci Rep. 2022;12(1):17556. Published 2022 Oct 20. doi:10.1038/
s41598-022-22215-y
14. Scheet P, Stephens M. A fast and flexible statistical model for large-scale population
genotype data: applications to inferring missing genotypes and haplotypic phase. Am J
Hum Genet. 2006;78(4):629-644. doi:10.1086/502802