You are on page 1of 15

Reviews

Opportunities and challenges for the


use of common controls in sequencing
studies
Genevieve L. Wojcik   1, Jessica Murphy2,3, Jacob L. Edelson   4, Christopher R. Gignoux2,5,6,
Alexander G. Ioannidis   7,8, Alisa Manning   9,10, Manuel A. Rivas4, Steven Buyske   11
and Audrey E. Hendricks   2,3,5,6 ✉
Abstract | Genome-​wide association studies using large-​scale genome and exome sequencing
data have become increasingly valuable in identifying associations between genetic variants and
disease, transforming basic research and translational medicine. However, this progress has not
been equally shared across all people and conditions, in part due to limited resources. Leveraging
publicly available sequencing data as external common controls, rather than sequencing new
controls for every study, can better allocate resources by augmenting control sample sizes or
providing controls where none existed. However, common control studies must be carefully
planned and executed as even small differences in sample ascertainment and processing can
result in substantial bias. Here, we discuss challenges and opportunities for the robust use
of common controls in high-​throughput sequencing studies, including study design, quality
control and statistical approaches. Thoughtful generation and use of large and valuable genetic
sequencing data sets will enable investigation of a broader and more representative set
of conditions, environments and genetic ancestries than otherwise possible.

Monogenic High-​t hroughput genome and exome sequencing urgently required to address this gap, including advances
A condition influenced by one has been foundational to advancing the understand- in sequencing technology to lower costs and enable
genetic locus. ing of disease aetiology, precision medicine and drug the generation of large, high-​quality studies of diverse
development1–7. From beginnings in rare diseases with genetic ancestries and environments. In the meantime,
Oligogenic
monogenic or oligogenic architecture, which studied to extend the utility of existing and future large-​scale
A condition influenced by a few
genetic loci. relatively small numbers of individuals or even single sequencing studies, data can be leveraged to serve as a
families8–10, rare variant discovery research has more community resource for comparison with sequenced
Polygenic recently encompassed common complex diseases, that cases, an approach generally known as common controls.
A condition influenced by
is, those with polygenic architectures, by using substantial Using common controls, rather than sequencing new
a large number of genetic loci.
resources to recruit and analyse genomes at scale4,11–13. controls for every study, can boost power to detect gen-
Allele frequencies Genome-​wide association studies (GWAS) for dichot- otype–phenotype associations by increasing the sample
The rates of genetic variant omous outcomes use array-​based or high-​throughput size or providing a control set where none existed. One
types in a specified population. sequencing technology to scan and compare the can distinguish between internal common controls,
Common controls
genomes of cases, recruited to have a certain condi- which were ascertained and processed as part of a sin-
Controls used for multiple tion, and controls, gathered for direct comparison. gle study that contains multiple case sets of different
studies. Specifically, by comparing the allele frequencies of a var- phenotypes17, and the more frequent external common
iant or the frequencies of rare alleles in a genetic region controls, which were recruited as part of an unrelated
of interest between the ascertained cases and controls, study. Although using external common controls as
GWAS aim to identify genetic variants or regions that the sole control group is more susceptible to bias and
are associated with the phenotype of interest. However, confounding compared with using internal controls, as
advances have not been equitably shared across ances- described in more detail below, the practice is remark-
✉e-​mail: audrey.hendricks@ tries, environments and conditions, exacerbating ineq- ably frequent18–36. Uses of external common controls
ucdenver.edu uities in research, healthcare and drug development as the sole control group include data analysis of cases
https://doi.org/10.1038/ and leading to poorer understanding of the complete recruited in a case-​only20,22,31,33,37 or family design for
s41576-022-00487-4 genomic landscape for all14–16. A multifaceted approach is rare diseases32,38, and to study germline associations in

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 665

0123456789();:
Reviews

Very large case– The use of common controls can free resources

Polygenic
control studies for functional work, careful phenotyping of cases and
e.g. T2DM, autism, additional recruitment of cases from diverse ancestries
CHD, Alzheimer
and environments, helping to address a persistent lack

Genetic architecture
disease
of representation in genomic research45–48. Lack of rep-
Common controls
e.g. alopecia, vitiligo, resentation in sequencing studies is compounded in
medical or surgical conditions that are neither rare nor common. Given
outcomes, adverse
drug effects
their polygenic and complex architecture, these uncom-
mon conditions require large sample sizes but their
Small family studies relatively low prevalence (for example, 0.2–1.8% for
Mendelian

and case groups vitiligo)49 makes mobilizing the necessary resources dif-
e.g. Kabuki syndrome, ficult (Fig. 1). In these situations, even large-​scale biobanks
cystic fibrosis
and general population cohorts may not be adequately
Rare Uncommon Common powered for discovery, which can be worsened by chal-
Condition prevalence lenges in deriving precise phenotypes50–52 and stigma for
certain phenotypes53. For example, in the 500,000 UK
Fig. 1 | Where to use common controls? Optimal study
design for a particular condition is determined, in part, Biobank population cohort, only 2,000 cases of Crohn’s
by genetic architecture (y axis) and prevalence (x axis) of disease are available for genetic study, with disease
the condition. Small family studies and case groups are pheno­typing derived from a mixture of questionnaire
useful for rare, Mendelian conditions, whereas very large and hospital inpatient record data, whereas focused dis-
case–control studies can be used for common, polygenic ease case–control sequencing studies can gather more
conditions where sufficient resources can be garnered. than 30,000 cases and 80,000 shared controls36. Even
Uncommon conditions with low to moderate prevalence when feasible, gathering and sequencing controls for
are likely to be polygenic, necessitating large sample all conditions is not an optimal use of resources due to
sizes. However, low prevalence of the condition may make redundancy in control sets and limited recruitment from
recruiting sufficient sample sizes and resources difficult.
under-​represented populations.
Whereas common controls are useful across the genetic
architecture and prevalence spectrum, common controls Whereas common controls hold great potential in both
are particularly valuable for polygenic, uncommon conditions array-​based GWAS17 and sequencing studies34–36,54, their
where sufficient resources may be lacking. CHD, coronary robust use can pose challenges, with missteps caused by
heart disease; T2DM, type 2 diabetes mellitus. inadequate harmonization of batch effects30,31,55, mismatched
ancestries19,37,42,56, incorrect filtering19,21,22,26,28,29,37,57–59 and
Bias
cancer for cases originally recruited to identify somatic insufficient documentation of data and methods for
Systematic error (as opposed mutations 19,29,39–42. Large external common control reproducible results20,24,28,32,33,37,40,60. These missteps can
to error due to chance data have also been used for more common conditions result in biased and incorrect results, especially when
processes), whether caused such as coronary artery disease, obesity, osteoporosis using external common controls as the sole control set.
by statistical methods,
and schizophrenia18,34–36,43 and can be especially suc- Careful harmonization, analyses and quality control are
differences between sampled
individuals and the population cessful for the detection of rare, strong-​acting alleles. essential when employing external common controls,
they nominally represent, For example, studies using common controls have given the great potential for bias and confounding. Here,
differences between cases led to the detection of rare loss-​of-​function variants we discuss the data, study design, infrastructure and
and controls in ascertainment
in SETD1 associated with schizophrenia35 and mis- methods required to incorporate common controls in
or sample processing, or
other issues.
sense and protein-​truncating variants in ATG4C have rigorous rare variant analyses to advance human genetics
been associated with Crohn’s disease36, among other research. Throughout, we highlight two studies as exem-
findings34. Interest in common controls has increased plars of how to incorporate external common controls in
in fields outside genomic research as well, for example, a robust manner34,36 (Box 1). Although the focus here is on
for clinical trials44. high-​throughput sequencing studies, much of the con-
tent is also applicable to array-​based GWAS. Ultimately,
Author addresses careful use of common control data can be an important
1
Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, tool for improving statistical power, addressing the lack of
Baltimore, MD, USA. data for understudied people or conditions and provid-
2
Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, USA. ing a more complete understanding of the entire human
3
Mathematical and Statistical Sciences, University of Colorado Denver, Denver, CO, USA. genetics landscape.
4
Department of Biomedical Data Science, Stanford Medical School, Stanford, CA, USA.
5
Human Medical Genetics and Genomics Program, University of Colorado Anschutz Data and study design
Medical Campus, Aurora, CO, USA. The use of common controls in genetic studies is unique
6
Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical in that the researcher has already defined a specific
Campus, Aurora, CO, USA.
research question and ascertained cases. Selection of
7
Institute for Computational and Mathematical Engineering, Stanford University,
a common control set is therefore driven by case set
Stanford, CA, USA.
8
Clinical and Translational Epidemiology Unit, Massachusetts General Hospital, Boston, characteristics, such as condition prevalence, genetic
MA, USA. ancestry, sequencing technology and sample process-
9
Metabolism Program, Broad Institute, Cambridge, MA, USA. ing. Selecting controls according to these considerations
10
Department of Medicine, Harvard Medical School, Boston, MA, USA. reduces the potential for confounding, both measured
11
Department of Statistics, Rutgers University, Piscataway, NJ, USA. and unmeasured. Beyond study characteristics, practical

666 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

Confounding considerations such as accessibility, size and type of data widely available, but may contain individuals with the
A spurious association or lack (individual level or summary level) may constrain the outcome of interest or relevant potential confounders.
of association caused by a utility of a common control data set. Below, we discuss An interesting example of this is the UK Biobank, which
third variable that is related to aspects of the genetic data, participant demographics and contains healthier and wealthier individuals than the
both the predictor variable (for
example, allele frequency) and
relevant confounders to support appropriate selection of general UK population, and thus may differ in related
the outcome (for example, case common controls such as those provided in Table 1. features from case samples not drawn from the UK
status). Biobank62,63. Depending on the ascertained phenotype
Ascertainment of common controls of the convenience sample, both the cases and the con-
Internal controls
Common control data sets typically fall into one of three trols from the original study may be suitable as common
Controls that were ascertained,
sequenced and processed
study designs: ascertained, convenience and population controls in a new study. The result of using a conveni-
together with the case sample. (Fig. 2). The relevant major difference among these designs ence sample ascertained for a different condition from
By contrast, external common is the expected prevalence of the outcome of interest the cases is often not straightforward, with effect sizes
controls were recruited, or related conditions and its influence on findings, inflated or attenuated depending on allele frequency
sequenced and processed
separately, often using
which is expanded upon below. deviations63,64. As such, identifying a reasonable and
different technology from the unbiased model can be difficult.
case sample. Ascertained controls. Ascertained controls are gathered
to exclude individuals with specific condition(s); the Population controls. Population controls, namely individu-
Biobanks
identification of a condition of interest occurs using, for als from the general population, are not ascertained with
Collections of both biological
samples (particularly DNA) example, hospital records, questionnaire responses, dis- respect to any particular outcome; they are easier and less
and health information from ease diagnosis or family history of disease. Additionally, expensive to recruit compared with ascertained controls
individuals generally assembled controls should have had the opportunity to develop but will often contain identified or unidentified cases.
from a region or a health the disease but did not, especially with respect to age The potential for misclassification as a control, when
system.
(for late-​onset outcomes), sex (for outcomes dispropor- in fact they are an unidentified case, will depend on the
Harmonization tionately affecting a specific sex)43 and location (such as prevalence of the condition. At one end of the spectrum,
The formation of a single for geographic region-​specific outcomes). For example, conditions of low prevalence will have a small proportion
cohesive data set from two or in a genetic association study of heart disease, partic- of unidentified cases in the control set, which is unlikely
more separate data sets by
ipants ascertained to be ‘controls’ at age 55 years may to greatly affect the results of the study. For example, the
standardizing scales,
definitions, quality control and very well become ‘cases’ by age 65 years. However, in Marenne et al. study of severe childhood obesity used
other processing. this instance it is understood that differences in the age INTERVAL, a population-​based cohort, as external
of onset do not imply differences in genetic exposure but common controls34 (Box 1). Although it is likely that the
Batch effects may result in a decrease in power to detect association. INTERVAL sample contained individuals who, when
Differences between groups
induced by processing over
A more extreme example is in studying host genetics they were children, met the criteria to be cases, the num-
different times, places or responses to infectious disease, where it is optimal for ber is likely to be quite small given an estimated preva-
technologies unrelated to cases and controls to have similar exposure to the infec- lence of severe childhood-​onset obesity of 0.15%65. As
biological causes. tious agent61. Although ascertained controls are the ideal the prevalence of conditions increases towards common
comparison group due to the designed absence of the conditions (for example, prevalence >20%)66, the misclas-
Quality control
A process where low-​quality condition of interest, they are also the least available, sification of cases as population controls will increasingly
data or observations are especially for uncommon conditions for which deliberate result in lower power and attenuated effect estimates67. In
identified and improved or exclusion was not the priority. rare variant association studies, the reduction of power
removed from further analysis. is likely to be more pronounced. Of course, the increase
Statistical power
Convenience sampling. Controls gathered through con- in control sample size could make up for the inefficiency,
The probability of rejecting the venience sampling, such as from a biobank or a sam- but not the bias of the effect estimates towards the null.
null hypothesis when it is false. ple ascertained for a different disease, are much more To reduce bias and increase power, using a control sam-
ple with fewer unascertained cases or case-​related traits
is preferable, particularly for common conditions.
Box 1 | Case studies for the exemplary use of common controls
Throughout this Review, we provide two case studies as examples of thoughtful and Comparability of cases and common controls
robust use of common controls. Of note, both are samples of European ancestries. Differences that are unaccounted for in sample char-
As discussed comprehensively elsewhere48,120,160, broader representation of ancestries acteristics (for example, demographics such as age
is needed in genetic studies including in the creation of common control data sets. and sex, genotyping or sequencing, genetic ancestries)
Marenne et al.34 use the INTERVAL study and gnomAD data (Table 1) as common between cases and controls can result in substantial con-
controls for severe childhood-​onset obesity project cases from the UK10K project.
founding and bias. For example, the Exome Sequencing
The UK10K project6 is an interesting example of a study where exome sequencing was
performed for cases (that is, obesity, neurodevelopment and rare diseases) without internal
Project (ESP) used two sequencing centres and found
controls, thus necessitating the use of external common controls34,35. The INTERVAL that differences in reagents and local analysis pipelines
sample was used as the primary common control set and gnomAD was used as a second resulted in unexpected batch effects68. An additional
common control data set to complete partial replication of the top gene regions identified. classic example is a study of exceptional longevity, which
Sazonovs et al.36 used two case–control discovery samples and three case–control originally failed to fully correct for differences in geno-
follow-​up samples to identify rare variant genetic associations with Crohn’s disease. typing platforms and found spurious associations55,69.
External common control data from the Centers for Common Disease Genomics Whereas many association methods can adjust for some
(CCDG)151 were identified to match the sequencing technology and computational differences, case attributes must be represented in the
pipeline of the discovery case samples, and INTERVAL83 and the UK Biobank85 were common control data to be able to use these methods.
used as external common controls for two of the follow-​up case samples.
This is especially important for genetic ancestries.

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 667

0123456789();:
Reviews

Table 1 | Resources for high-​throughput sequencing data sets that could be used as external common controls
Resourcea Sizeb Ancestries Datac Permissions Description
Individual-​level data d

1000 WGS/WES: 5 continental ancestries (African, – None Catalogue of human variation and
Genomes 2,504 East Asian, European, South genotype or sequencing data from
Project114 Asian and admixed American) self-​identified ‘healthy’ individuals
(AnVIL) and 26 populations
CCDG151 WES: 198,831 ~25% each of the African, – Permissions obtained Collection of case–control studies of
(AnVIL) admixed American and via dbGaP cardiovascular, neuropsychiatric and
WGS: 135,853
European continental ancestries immune-​mediated diseases
Estonian WES: 3,000 Estonian (83%), Russian (14%) Phenotypes Access form required Longitudinal, population-​based
Biobank81 and other (3%) cohort study of subjects recruited
WGS: 2,500
by general practitioners and hospital
physicians
H3Africa121 WGS: 581 African (50 ethnolinguistic Phenotypes Access form required Population-​based studies of
groups from 13 African common diseases in Africa, such as
WES: 314
countries) trypanosomiasis and paediatric HIV
HGDP115 WGS: 929 (828 54 populations – None WGS library of diverse indigenous
online) populations
INTERVAL83 WES: 4,502 ~91% white British individuals Phenotypes Permissions obtained Study of blood donation frequency
via EGA across England
WGS: 5,592
SGDP152 WGS: 300 142 populations (127 in publicly – Access form required Deep genome sequences from smaller
available data) for 21 genomes diverse populations
TOPMed84 WGS: ~155,000 41% European, 31% African, 15% Phenotypes, Permissions obtained More than 80 parent studies
(AnVIL, Hispanic/Latino, 9% Asian and ‘omics via dbGaP (prospective, case–control, family
BioData 4% other/unknown and case-​only) focusing on heart,
Catalyst) lung, blood and sleep disorders
UK WES: ~200,000 ~84% white British individuals Phenotypes Registration, data Prospective, population-​based cohort
Biobank85 WGS: ~300,000 application and fee study collected from volunteers across
required the UK
Summary-​level data
dbGaP WES: 29,931 12 populations – None Allele frequencies for variants in
ALFA153 dbGaP across approved unrestricted
WGS: 25,478
studies
FinnGen87 WGS: 3,775 Finnish Phenotypes Access form required A large public–private partnership
aiming to collect and analyse genome
and health data from 500,000
Finnish biobank participants
(10% of population)
gnomAD WES: 125,748 7 global and 7 subcontinental – None Exome and genome sequencing data
v2.189 WGS: 15,708 ancestries primarily from case–control studies of
common adult-​onset diseases
gnomAD WGS: 76,156 9 ancestries – None Genome sequencing data primarily
v3.189 from case–control studies of common
adult-​onset diseases
jMorp154 WGS: 8,380 Japanese ‘Omics Access form required Japanese reference panel of two
for genotype prospective cohort studies (a population-
frequency panel based adult cohort and a birth and
three-generation cohort)
SISu v4.1155 WES: 10,490 Finnish – None Finnish database of 13 cohorts with
population controls and cases/controls
from disease-​specific studies
Taiwan WGS: 1,517 Han Chinese Phenotypes None Taiwanese database of healthy controls
Biobank88 from cohort and case–control studies of
local diseases
TOPMed WGS: 132,345 Multiple – Google login required Variant browser for 705 million variants
BRAVO84 in TOPMed Freeze 8
Summary and individual-​level datad
All of Us103 WGS: 98,640 49% European, 23% African, 15% Phenotypes Individual: Effort to collect and study health data
(Researcher Latino/Admixed American, 9% registration required from a diverse and representative group
Workbench) Other, 2% East Asian, 1% South through the All of Us of people in the USA
Asian Research Program

668 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

Table 1 (cont.) | Resources for high-​throughput sequencing data sets that could be used as external common controls
Resourcea Sizeb Ancestries Datac Permissions Description
Summary and individual-​level data (cont.)d

CSVS90 WGS/ Spanish Phenotypes Summary level: Crowdsourcing repository of Spanish


WES: 2,027 access form required; genetic variability
(individual: 267) individual level:
permissions obtained
via EGA
GenomeAsia WGS: 1,739 7 global regions (64 countries, – Individual level: WGS reference data set
100K116 219 population groups) with access form required
~80% Asian individuals
Data repositories of individual-​level data
dbGaP91 WES: ~500 Multiple Phenotypes NIH DAC application Database of studies that investigate
(AnVIL, studies required the interaction between genotypes
BioData and phenotypes in humans
WGS: ~300
Catalyst)
studies
EGA92 WES: ~1,200 Multiple Phenotypes DAC application Archive of genetic and phenotypic data,
studies required mostly of various types of cancer
WGS: ~1,050
studies
DAC, data access committee; dbGaP, database of Genotypes and Phenotypes; EGA, European Genome–Phenome Archive; NIH, National Institutes of Health; WES, whole-
exome sequencing; WGS, whole-​genome sequencing. aTable limited to resources that can be widely accessed by researchers globally. Future resources156–158 or other
potentially useful data sets that require consortia membership or collaboration159 are not included. AnVIL currently contains genomic data from eight consortia and
more than 280,000 participants, and BioData Catalyst from some 155,000 participants in TOPMed along with the parent studies. Resources currently available in AnVIL,
BioData Catalyst or Researcher Workbench are noted. bThe number of individuals with WGS and/or WES is provided unless otherwise noted. cAdditional data provided
including phenotypes or other ‘omics data. All resources provide sex and ancestry information. dAll individual-​level data resources, except for GenomeAsia100K and the
Collaborative Spanish Variability Server (CSVS), have FASTQ or BAM/CRAM files available for joint recalling. H3Africa and the UK Biobank also have gVCF files available.

For instance, although ancestry inference methods can be and usefulness of common control data. Electronic
used to adjust for differences among African American health record data, such as from biobanks, have stand-
admixed samples, there are limits to their applicability; an ards for the translation of International Classification of
African American sample cannot be robustly compared Disease (ICD) codes into standard codes, such as with the
Ascertained cases with a sample with only European ancestries. A failure Observational Medical Outcomes Partnership (OMOP)79
Participants of a study who
are recruited to have a known
to adequately match and adjust for genetic ancestry can and the ICD-9/ICD-10-​compatible phecodes80. However,
disease, outcome or condition result in population stratification with an increase in both many data harmonization efforts are focused on one or
of interest. false positives and false negatives due to genetic differ- a handful of conditions, whereas harmonization for
ences unrelated to disease risk. Classic examples of spu- common control data must be broadly focused to enable
Ascertained controls
rious associations as a result of population stratification interoperability with various conditions. To enable broad
Participants of a study who are
recruited to not have a known include height with European ancestry70, type 2 diabetes utility of common control data, information relating to
disease, outcome or condition mellitus (T2DM) with Native American ancestry71,72 and age, sex, ancestry and chronic conditions should be stand-
of interest. asthma with Indigenous ancestry in Latinx groups72,73, ard inclusions. In general, there are two main categories
among others70,72,74,75. Matching by fine-​scale ancestry in of common control data: summary level (such as allele
Convenience sample
A sample drawn from an easily
addition to continental ancestry is especially important frequencies) and individual level. For individual-​level
accessible, but often not for rare variants, which are more likely to segregate on data, information should be included at the subject level,
representative, cohort. a geographically smaller scale76–78 (Fig. 3a–d). Similarly, whereas descriptive statistics (for example, five-​number
although rare variant methods have been developed to use summary, mean, standard deviation) can be provided
Population controls
common control data sequenced and generated with dif- for summary-​level data. Importantly, reporting of chronic
A control group sampled from
a population but possibly ferent technology and computational pipelines (Table 2), conditions enables cases or individuals with case-​related
lacking information regarding matching as closely as possible will reduce bias and traits to be removed from the common control data set.
the condition of interest, increase coverage of the genome (Fig. 3e,f). Both Marenne A high degree of phenotypic metadata and study
with the result that some of the et al. and Sazonovs et al. limit common controls to those detail is essential to ensure widespread interoperability
population controls will likely
have the condition of interest.
with European ancestries and match closely by sequenc- of common control data, not to mention an apprecia-
ing technology (Box 1). Indeed, Sazonovs et al. identify tion (for example, funding and incentives) for the efforts
Admixed and use several common control data sets, including needed to tailor and maintain common control data sets
A term to denote the mixture INTERVAL, the Centers for Common Disease Genomics for broad use (Box 2). Of note, the Estonian Biobank81,
of genetic ancestries from two
(CCDG) and the UK Biobank, to match the technology H3Africa 82, INTERVAL 83, TOPMed 84 and the UK
or more divergent groups.
and sequencing centres for four Crohn’s disease case sets. Biobank85 are examples of individual-​level common con-
Population stratification trol data sets that have well-​curated phenotypic data to
The presence of Curation of phenotypic data enable broad use as common controls. To better enable
subpopulations with differing Well-​curated and standardized phenotypic information, easier identification of large-scale sequencing datasets
allele frequencies in a study;
a source of confounding if
which can enable optimal choice of common control data, with deep phenotype information, Gutierrez-Sacristan
phenotypes also vary by removal of cases and adjustment or assessment of envi- et al. provide a dynamic catalogue from which users can
subpopulation. ronmental factors, is key to supporting interoperability identify data with desired characteristics86. Finngen87,

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 669

0123456789();:
Reviews

False positives the Taiwan Biobank88, gnomAD89 and the Collaborative in order to maintain participant privacy according
Test results that are statistically Spanish Variability Server (CSVS)90 are summary-​level to existing consents, and require significantly more
significant even though common control data sets that provide summary sta- resources to use than summary-​level data. Indeed,
there is no real association. tistics or grouping of age, sex, ancestry and common although individual genetic data are often accessible
By contrast, a false negative
is a test result that is not
conditions. For instance, gnomAD v2.1 provides a from a central database (for instance, the database of
statistically significant even ‘control’ subset where individuals recruited as cases are Genotypes and Phenotypes (dbGaP)91 or the European
though there is a real excluded, and the CSVS enables filtering to provide allele Genome–Phenome Archive (EGA)92), they can often
association. frequency excluding selected conditions. be missing important metadata, be in a less processed
state or be divided into sub-​data sets that can be dif-
Accessibility and usability of common control data ficult to combine. These aspects, combined with har-
Individual-​level data enable better harmonization and monization and processing of individual-​level data,
analysis between cases and common controls. However, necessitate a large amount of resources (for example,
individual-​level data often have more barriers to access, person-​time, computing and data storage/transfer cost)
for appropriate use. Numerous common control data
Cases sets are hosted on cloud computing environments to
minimize these hurdles (see Infrastructure). For exam-
ple, the UK Biobank Research Analysis Platform enables
+
researchers to access data from a centralized location
and although there are costs, they are tiered depending
on computational and financial needs, with discounted
a Ascertained controls (without condition) b Convenience sample rates for student researchers or those from low and
middle-​income countries93.
Conversely, summary-​level data are often readily
available for download, have undergone extensive qual-
ity control and processing, have few to no barriers to
access and require fewer resources for storage, transfer
and analysis. However, summary statistics, such as allele
frequencies, can mask heterogeneity and stratification
within and between samples, including population
structure, sample recruitment and processing. These
differences can cause severe batch effects between cases
and external common controls, resulting in biased
results. Additionally, adjusting for covariates is either
more difficult or impossible. Although some methods,
c Population controls (20% prevalence) d Population controls (1% prevalence)
such as iECAT-​O94 and ProxECAT95 (Table  2), have
been developed to incorporate allele frequency data
while reducing the potential for bias94–98, more thorough
vali­dation and replication of results are necessary than
with individual-​level data. There are proposed inter-
mediate frameworks where case data are uploaded to
a central location and allele frequencies from matched
individual-level common control data by ancestry and
other covariates are returned; however, these frame-
works are not yet widely available99. The CSVS90 is an
especially interesting example of how to collect and dis-
tribute sequencing data. The CSVS crowdsourcing initi-
ative encourages genomic projects and consortia across
Without condition With condition Other condition Spain to submit whole-​exome sequencing (WES) and
whole-genome sequencing (WGS) data, resulting in a
Fig. 2 | Types of control samples. Three regularly used types of controls include ascer- database of allele frequency information with the ability
tained, convenience and population controls. a | Ascertained controls are specifically to interactively exclude chronic disease subgroups90.
collected to exclude the condition and conditions related to case status. b | Relative Lastly, when choosing a common control data set,
ease of ascertainment or use is the defining factor of convenience samples. In human close attention should be paid to the consent type
genetics, convenience samples may have been ascertained for another condition related of each contributing study. Only data labelled as no
to case status (blue people) and may also include unidentified cases (orange person). restrictions (NRES), general research use (GRU) or
c,d | Population controls are a random sample of controls from the general population that health, medical biomedicine use (HMB) can be used as
often contain unidentified cases based on prevalence of the condition. Choosing which
common controls for various phenotypes. (Studies with
controls to use as common controls should be based on the study design and research
question. For instance, population controls are appropriate for rare conditions (for example,
disease-​specific consents can, of course, be used as com-
prevalence <1%) (part d) but not for common conditions (for example, prevalence >20%) mon controls for the matching disease.) Because of this,
(part c), as the high proportion of unidentified cases for a common condition would affect studies that obtain broad consents are far more useful
the study results including power. Ascertained controls are ideal, but more difficult and sources of common controls than those with narrower
expensive to collect compared with convenience sampling and population controls. consents (Box 2).

670 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

Ancestry

a b

0.005

Allele
frequency
maf 0.000

PC2
0.75 0.75

0.50 0.50
–0.005
0.25
0.25
–0.010

–0.008 –0.004 0.000 0.004


PC1

c d EUR
100
Mexico: 62.6–87.3%
South Korea: 68.9% Colombia: 55.8%
20 80
Japan: 77.6–82.1% Peru: 68.0% 40
China: 77.5–85.9% 60
60 40
Trinidad and Tobago: 25.0%
Pakistan: 54.55% Barbados: 10.9–20.9% 80
India: 54.2–61.8% 20
Puerto Rico: 36.9%
Bangladesh: 65.28% 100
AFR 20 40 60 80 100 AME

Coverage Processing
e Whole genome Whole genome Whole exome f Pipeline A Pipeline B Pipeline C
High depth Low depth High depth

Exon Intron Variant

Fig. 3 | Types of bias that could affect common control analyses. d | Differences can also occur within a region due to admixture proportion
Differences between cases and controls not due to case status, such as differences between groups, whether two-​way or three-​way admixture.
differences in ancestry, coverage or processing, could result in confounding e | Coverage can differ in which part of the genome is sequenced (for
and lead to inaccurate conclusions. a | Allele frequencies (here for example, genome versus exome) and in number of sequencing reads
rs17578381) can differ greatly both between continental-​level regions and (for example, high depth or low depth). Type of coverage determines how
within more fine-​scale regions, requiring careful matching or adjustment of many and which variants are detected. f | Processing computational
ancestry. b | Some of this matching can be conducted using principal pipelines can differ in number and type of steps such as the variant calling
components (PCs). However, additional attention must be made beyond the algorithm, which can lead to differences in the variants detected in the
first PCs to ensure fine-​scale substructure is accounted for. c | This processed samples. To reduce ancestry, coverage or processing biases, cases
substructure can occur within continental-​level groupings or self-​identified and external common controls should be harmonized prior to analysis.
racial categories, such as within Asia and even within East Asia. AFR, Africa; AME, Americas; EUR, Europe.

Infrastructure workflow, researchers often use a combination of


To ensure the broad, robust and equitable use of com- local and cloud computing. Cloud computing is an
mon controls, infrastructure must be secure, easy to use increasingly widespread alternative to local computing,
and widely accessible. In addition to traditional infra- ranging from a predefined, limited workspace where a
Fine-​scale ancestry structure, such as data storage, transfer and computing, specific task is performed (for example, the TOPMed
Genetic differentiation
infrastructure to support educational training to use Imputation Server)84,100 to flexible user environments
at a regional level (such as
subcontinental), as opposed common control resources is needed. that store data and modifiable analysis pipelines, such
to continental-​level ancestry. Common controls necessitate two primary areas as the National Human Genome Research Institute’s
of computational infrastructure: storage and mainte- (NHGRI’s) AnVIL101, which includes data from the
Metadata nance of common control data sets in broadly acces- CCDG, the National Heart, Lung, and Blood Institute’s
A high-​level description of a
data set, often including details
sible locations; and environments and workflows to (NHLBI’s) BioData Catalyst102, which includes data from
of the cohort and of data bring together and analyse common control and case TOPMed, and the National Institute of Health’s (NIH’s)
generation. data (Fig. 4). Depending on data permissions, size and Researcher Workbench103.

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 671

0123456789();:
Reviews

Table 2 | Case–control association methods or frameworks that incorporate preprocessed, the entire quality control process should
external common controls be well documented, including relevant variant-​level
and individual-​level quality control metadata, so that
Method Method type Internal Data type Covariates
controls users can assess and match the quality control per-
formed (Box 2). Here, we detail considerations for both
Individual-​level data sample-​level and variant-​level quality control, with
Chen and Lin128,a Single variant No Sequencing Yes current best practices for reducing bias.
iECAT Score test127 Single variant Yes Array Yes
Sample-​level harmonization and quality control.
Summary-​level data
Identical quality control filters should be applied to
iECAT-​O94,b Optimal combination Yes Array or No common controls and cases to ensure consistency
of burden and variance sequencing
component between data sets. This includes standard quality con-
trol procedures such as removing individuals with poor
ProxECAT95,c Burden No Sequencing No
sequencing quality, a low proportion of individuals
RV-​EXCALIBER 98
Harmonization framework No Sequencing No with a genotype call (that is, a low call rate) or high con-
for burden test tamination, among other well-​documented filters64,107.
TRAPD96 Filtering/harmonization No Sequencing No Covariates, and outcome definitions, where available,
framework for burden test should be defined and harmonized to ensure compa-
a
Requires variant depth and quality information. bOptimal in the presence of moderate to rability. In addition to choosing a common control data
large single-​variant confounding and requires the internal sample minor allele count to be
greater than two. cOptimal for very rare variants and can use variants with a minor allele count set that contains the genetic ancestries of the case data,
greater than zero. special consideration must be made to closely match
genetic ancestry at both a continental-​specific and
The advantage of the cloud in ‘bringing the user to region-​specific level to address fine-​scale substructure
the data’ can be especially useful for common control (Fig. 3a–d).
studies where large control data sets can be indexed and For individual-​level data, alignment of genome-​wide
stored in a central location that is available to author- (global) ancestry for cases and controls can be done with
ized users, avoiding redundancy as well as improving methods such as ADMIXTURE110, which estimates the
accessibility for groups lacking in-​house computational proportions of genetic ancestry in each individual, or
resources101,102,104,105. Users can upload their own data to through projection methods111,112, such as principal com-
integrate and analyse with common control data using ponent (PC) analysis113, with or without diverse ancestry
automated analysis workflows optimized for cloud envi- reference data sets such as the 1000 Genomes Project114,
ronments and made available in cloud repositories106. An the Human Genome Diversity Project (HGDP)115 and
alternative workflow uses the cloud environment for GenomeAsia116. As ADMIXTURE estimates the pro-
selecting common control data and local computing portion of the genome from a specific discretized ances-
for analysis. try, it is not able to provide insights beyond this often
continental-​level classification, such as subcontinental
Harmonization, quality control and association or subregional fine-​scale structure. Therefore, it may
analysis be ideal to closely match cases and controls using clas-
Following the selection of an appropriate common control sification methods that leverage continuous measures
data set, the next step is to implement and iteratively cal- of ancestry, such as the PCs that explain a substantial
ibrate a harmonization, quality control and analysis pipe- amount of variation. This strategy can also be used for
line (Fig. 4) to the specific case and common control data admixed populations as long as the PCs used for match-
sets. Although iterative pipeline optimization is useful for ing are informative for ancestry proportions. One clas-
any large-​scale genomic analysis with batches, it is espe- sic example of the importance of matching by ancestry
cially necessary when using external common controls. was demonstrated in a study of asthma in Latino par-
Without careful harmonization, analysis and quality con- ticipants, in which participants needed to be matched
trol, systematic differences (Fig. 3) can result in bias and an on both a subcontinental level (Mexican and Puerto
increased rate of both false positives and false negatives. Rican) and also by ancestry proportions within groups
to adequately control for population stratification and
Harmonization and quality control false positives73. For analyses that focus on a specific
To ensure that cases and controls are comparable with genomic region in admixed populations, local ancestry
one another, they must be harmonized prior to analy- can be estimated using methods such as RFMix117 or
sis with respect to both sample-​level and variant-​level Gnomix118. Estimation of ancestry in summary-​level
features. Case sample quality control can be performed data is less established and often limited to incomplete
before or during harmonization using well-​established matching of cases and common controls by reported
steps for genetic data64,107 and standard software such race, ethnicity and/or genetic ancestry, which can
as PLINK108,109. The level of harmonization needed result in removal of non-​homogeneous groups from
to reduce bias will depend on the extent of the differ- analysis or residual population stratification. More
ences in sequencing, such as scope (whole exome ver- recently, Summix119 enables genetic ancestry estima-
Local ancestry
The genetic ancestry of a
sus whole genome) or depth of coverage, processing tion and adjustment of allele frequencies requiring only
particular chromosomal region and recruitment between cases and external common summary-​level reference data, which is beginning to address
on a haplotype level. controls. If external common control data have been this limitation.

672 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

Minor allele frequency


Using individual-​level data, both Marenne et al. harmonization and likely the use of analysis methods
(MAF). For a genetic variant and Sazonovs et al. (Box 1) used PCs to assess concord- developed explicitly for use with external common con-
with two alleles, the frequency, ance of genetic ancestry between cases and common trols (Table 2). Genetic variants or regions with poor or
in a specified population, of controls34,36. Sazonovs et al. further used random forest differential sequencing quality between cases and con-
the less frequent allele.
on PCs generated from 1000 Genomes Phase 3 reference trols64,96 identified using variant quality metrics, such as
data to classify UK Biobank samples into broad genetic genotype quality, Hardy–Weinberg equilibrium (such
ancestry groups (Europe, Africa, South Asia, East Asia, as PHWE < 10–4)120, variant quality score log-​odds, depth
admixed). Samples classified as European ancestry were of coverage and others64, should be removed. Removal of
retained. genetic variants or regions can be performed with both
individual-​level and summary-​level data if adequate
Variant-​level harmonization and quality control. Large variant quality control metrics are detailed.
differences in sequencing (or genotyping) technology A benefit of individual-​level data is the possibility
and processing (Fig. 3e,f) require a greater degree of to improve variant calls through recalling cases and
common controls together34,35. Both Marenne et al.34
Box 2 | Considerations for the generation of new common control resources and Sazonovs et al.36 performed individual-​level inter-
Creation of broadly useful common control data sets and infrastructure to support their
mediate variant calling to produce gVCF files and then
use requires careful consideration. Here, we outline some guiding principles to consider performed joint variant calling across case and com-
when establishing new common control resources. mon control samples (Box 1). The feasibility of jointly
• Metadata. There are resource and time costs in working with a new data set. Detailed calling cases and common controls together depends
descriptions are valuable both in helping researchers decide whether to use a on the computational resources required and available.
particular data set and when harmonizing the chosen data. Helpful information Larger sample sizes and whole-​genome sequenced sam-
includes methods and procedures used in creating the data, sequencing technology, ples will require more resources, as will less processed
coverage, variant calling algorithm, recruitment details, case definition, inclusion files. For instance, BAM files require more computa-
criteria and ancestry description, among others. tional resources for storage and processing compared
• Variant-​level quality control metrics. Variant-​level quality control metrics enable with gVCF files. All individual-​level data resources pro-
consistent quality control decisions across case data sets prior to harmonization with vided in Table 1, except for GenomeAsia100K116 and the
the common control data set.
CSVS90, have FASTQ or BAM/CRAM files available for
• Individual-​level quality control metrics. Similar to variant-​level quality control metrics, joint recalling. H3Africa121 and the UK Biobank85 also
individual-​level quality control metrics also enable consistent quality control decisions
have gVCF files available. For common, low-​frequency
between the common control data set and case data sets. For summary data, where
individual-​level metrics are not available, summary-​level information pertaining
and, increasingly, rare variants, variant calls can be
to individual filters and distributions of quality control metrics should be provided. improved further with imputation of individual-​level
• Rich phenotype and covariate data. Incorporating available phenotypes and
data using resources such as the TOPMed Imputation
covariate information beyond just a few demographics can enable identification and Server84,100.
removal of cases or related conditions from the common control data set as well as
controlling for or evaluating the role of other conditions or environments. Age, sex, Filtering to rare alleles. Rare variant association tests
ancestry and known chronic conditions should be standard variables in common use filters or weights by minor allele frequency (MAF).
control data sets of individual-​level data. For summary data, descriptive statistics Although there are several appropriate filtering designs,
of these variables should be provided and data should be grouped when possible it is essential that the same criteria be applied to both
(for example, by ancestry or condition) to reduce heterogeneity in allele frequencies. cases and controls64. For instance, filtering using only the
• Broad consent and sharing standards. Studies should seek broad consent, such as common control data set, but not the case data set, will
no restrictions (NRES), general research use (GRU) or health, medical biomedicine remove all variants above the MAF threshold in controls,
use (HMB), to allow their data to be readily used as common controls. Consents that but not in cases, resulting in spurious associations64. The
specify disease-​specific research, while not impeding the original investigators, make
ideal method for filtering to rare variants is to utilize
the subsequent data of little use for common controls. Additional conditions, such as
an independent, well-​matched (that is, by ancestry and
letters of collaboration, add to research friction, particularly when investigators need
to combine several studies. Restrictions on who can use the data (for example, only sequence technology and processing) external data set.
not-​for-​profit organizations) also reduce the value of the resource161. Importantly, some common control data sets contain
• FAIR principles. FAIR principles state that data should be Findable, Accessible, data that researchers might use to filter. For instance,
Interoperable and Reusable for both humans and machines162. These principles are gnomAD contains 1000 Genomes Project114 and NHLBI
central to the utility of common control data, which must be useful for other researchers ESP122 data. As such, ideally, neither the 1000 Genomes
and multiple case data sets. Project or the ESP should be used for filtering when gno-
• Intermediately processed data. When storage costs allow for it, inclusion of mAD is used as the common control data set. When a
intermediately processed data, such as gVCF files105, will allow for more efficient separate, well-​matched, external data set is not available,
joint recalling of cases and controls together, which can improve variant calls and filtering can be applied using both case and control data
harmonization. by keeping only variants that are rare in both data sets95.
• Representation. Common control data sets and the researchers working to create Marenne et al.34 used external data sets (1000 Genomes
new data should be representative of the worldwide population. Such diversity Project and UK10K cohort reference panel) as well as the
helps ensure that challenges relating to a broad set of conditions, environments case and common control data sets for filtering, whereas
and ancestries are considered early on. Sazonovs et al.36 used an external, ancestry-​matched data
• Funding. Many of the recommendations listed above require additional time and set (gnomAD Non-​Finnish European) (Box 1). Studies
resources, often above and beyond a study’s primary mandate. As such, funding of rare variant associations are also commonly limi­
and incentives are essential to support the creation and maintenance of high-​quality ted to variants with predicted functional effects as a
common control data.
way to improve power34,35, with multiple computational

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 673

0123456789();:
Reviews

a Pre-analysis Research question(s) Example infrastructure

Internal case sample External common controls Account set-up and billing
• Individual vs summary
• Control type
• Case prevalence
• Ancestry
Processing pipeline • Processing pipeline
• Subject consent

Data access
Bring together e.g. AnVIL
Dataset Catalog

b Harmonization Harmonization
and analysis
• Depth of coverage Workflows
• Genotype quality
• Genome build
• Variant annotation
• MAF filtering

Analysis
8
Best practices Methods repository
–log10(P value)

6
Assess,
4 update
and
2
iterate
0
1

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20
22

Chromosome

Post-analysis quality control Interactive notebooks

8
–log10(observed)

6
• QQ plots
4 • Estimates of
2
polygenicity
0
0 1 3 4 5 5 6 7
–log10(expected)

c Results

Verification Interpretation Reproducibility and transparency


• In silico validation (e.g. • Contextualize findings
Open access Controlled access
depth, genotype quality) including limitations
• Partial replication (e.g. • Consider generalizability
allele frequencies in other to different ancestries,
common control data sets) ethnicities and environments

prediction tools (for example, FAVOR123, CADD124, Association methods


PolyPhen-2 (ref.125) and SIFT126) available that must Association methods have been explicitly developed to
be identically applied to cases and controls, as any robustly incorporate common controls while controlling
differences will introduce bias. for technology and processing differences between cases

674 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

◀ Fig. 4 | Common control analysis workflow and example infrastructure with AnVIL. improvement of the pipeline. After iteration of this pro-
a | Case–control analysis begins by identifying a research question and associated cess until there is no longer any evidence of batch effects,
condition of interest. Collection and processing of cases and potentially internal controls the study can progress to assessment and interpretation
are then conducted. Processing steps include sequencing, variant calling, imputation of results. If there are unremovable batch effects, how-
and quality control. External common controls are chosen to match cases as closely as
ever, a different common control data set may be needed.
possible on potential confounders. Internal and external data are then brought together
in a computing environment. If utilizing infrastructure such as AnVIL, accounts need to Starting with a subset of the data (for example, one chro-
be created for the Terra cloud computing platform and Gen3 data platform. Terra Billing mosome) and performing soundness checks (for exam-
Project is created and linked to Google Billing Project or access to a specific billing ple, comparing distributions of quality control metrics
project is requested. Resources such as AnVIL Dataset Catalog or Gen3’s Data Explorer between cases and controls) at each stage of the process
can be used to search for common control data. Additionally, open (for example, 1000 enable facile assessment and updating of the pipeline.
Genomes), controlled (for example, database of Genotypes and Phenotypes (dbGaP)) The most common technique in a genome-​wide study
or consortium access data can be requested and used. b | Cases and common controls to assess a pipeline’s effectiveness is to use quantile–
should be harmonized prior to analysis to reduce bias. Terra library or Dockstore, which quantile (QQ) plots to compare the observed distribu-
contains Broad Methods Repository and GATK Best Practices Toolkit, can be used to find tion of test statistics with the expected distribution under
workflows. Analysis method, either single-​variant or region-​based, can be implemented
the null hypothesis of no association. Systematic bias will
and post quality control performed. Jupyter Notebook can be created for interactive and
collaborative analysis using Python3 or R/Bioconductor, or a cloud environment can manifest in a QQ plot with an inflated distribution of test
be created for Galaxy. Harmonization, analysis and post-​quality control steps should statistics at the median (that is, λGC > 1)96,107. Although
be iterated and updated until batch effects are no longer evident. c | Results are verified QQ plots can identify inflation of test statistics, they
and contextualized within limitations of a common control study. Reproducibility and cannot determine whether the inflation is due to popu­
transparency supported by making the code and harmonization pipeline publicly available lation structure or polygenicity130 using the single λGC
(for example, on GitHub) and by citing methods and processing steps. Furthermore, data value. For WES, assessing inflation in rare variant test
and results (for example, test statistics) should be provided to the research community statistics derived from synonymous variants can start to
through a publicly available portal such as the genome-​wide association studies (GWAS) disentangle bias from signal, as we expect less association
catalogue for open access and dbGaP or a consortia website for controlled access. signal with synonymous variants96. For WGS, synony-
For more detailed tutorial on AnVIL infrastructure, see https://anvilproject.org/learn.
mous or a random subset of non-​coding variants may be
MAF, minor allele frequency; QQ, quantile–quantile. Logos reprinted with permission
from AnVIL (https://anvilproject.org/data); Dockstore (https://cancercollaboratory.org/ useful to identify residual bias. Additionally, the degree
services-​dockstore); GATK (https://gatk.broadinstitute.org/hc/en-​us/categories/ to which a genetic marker tags local variability as esti-
360002310591); Terra (Geraldine Van der Auwera at the Broad Institute); Broad Institute mated using linkage disequilibrium score regression can
(https://www.broadinstitute.org/journalists/logos-​graphics); dbGaP (https://www.ncbi. help distinguish polygenicity from population structure
nlm.nih.gov/sra/docs/submitdbgap/); Bioconductor (https://www.bioconductor.org/ effects131. Of note, careful attention to quality control,
about/logo/); GWAS Catalogue (https://www.ebi.ac.uk/gwas/). Jupyter is reprinted with filtering and association analysis is especially important
permission from Jupyter (https://commons.wikimedia.org/wiki/File:Jupyter_logo.svg), for candidate gene studies where bias cannot be assessed
CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). through genome-​wide metrics, such as QQ plots.

and controls (Table 2). The most computationally effi- Verification, follow-​up and reproducibility of results
cient methods incorporate summary-​level common When common controls are used, validation and rep-
control data. Methods such as iECAT can increase lication of results are at the same time more important
power to detect rare variant genetic associations in an and more difficult to implement than for traditional
existing case–control study by augmenting with com- case–control studies. The necessity of common con-
mon controls94, whereas ProxECAT95, TRAPD96 and trols often follows from a scenario in which resources
RV-​EXCALIBER98 enable the use of common controls are scarce; thus, gathering another independent and
as the sole control set. Some common control data, such as sufficiently large replication sample of cases, let alone
gnomAD, contain both exome and genome data. These cases and controls, may be difficult. In such a scenario,
allele frequencies should be kept separate when using the in silico validation and partial replication can be per-
preceding methods, otherwise the methods will not be formed. In silico validation includes an in-​depth con-
able to adequately estimate and adjust for differences in firmation that top hits are not driven by subtle variant
sequencing technology and processing. Other methods quality issues. Partial replication, using other external
incorporate individual common control data using only common control data sets or public databases, can be
variant calls127, using variant calls with other data such as used to cross-​reference genotype or allele frequencies
In silico validation read depth128 or using different data, such as raw reads, in the original common control data set or to complete
Secondary quality control
analysis of genotype calls,
directly129. These methods can help alleviate residual dif- another association test with the discovery case sample.
often of top association results, ferences due to technology or computational pipelines Although these strategies do not comprise a traditional
that passed the initial especially when jointly recalling cases and controls is external validation, they do provide additional support
harmonization process to not feasible or does not fully alleviate bias. Importantly, for or against results from the primary analyses.
ensure that differences in
these methods do not inherently correct for differences Sazonovs et al. and Marenne et al. both performed
processing do not drive
important association signals. in ancestry and should be used along with ancestry replication using additional independent case and exter-
adjustment and matching discussed above. nal common control data34,36 (Box 1). Sazonovs et al. used
Partial replication three follow-​up case–control studies for replication
Repeating association analysis Post-​analysis quality control including whole-​genome sequenced INTERVAL83 and
reusing some data from the
discovery analysis (for example,
Post-​analysis quality control helps identify any remaining whole-​exome sequenced UK Biobank85 external com-
discovery cases and new systematic bias and issues in the harmonization and analy­ mon controls. Marenne et al. performed replication of
external common controls). sis pipeline, which can then inform a recalibration and nine genes using targeted sequencing of independent

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 675

0123456789();:
Reviews

severe childhood-​onset obesity cases and external com- broad consent from research participants is preferred for
mon controls from the Fenland Study132. Additionally, common control data, as it allows the data to be more
Marenne et al. performed partial replication of the widely used, consent must be obtained with respect,
original severe childhood-​onset obesity samples with accompanied by authentic community engagement and
gnomAD non-​Finnish European controls using the trust building142. Efforts to ensure an equitable inclusion
association method ProxECAT95. of and partnership with people across environments,
To ensure reproducibility of research with common ancestries and conditions must become the standard. As
controls, code used for harmonization, quality control such, structures for funding, publishing and rewarding
and analysis should be publicly available, and common science must include expectations for representation,
control data used should be clearly identified including equity and community engagement.
version, access location and date accessed. Infrastructure The generation and maintenance of rich phenotypic
such as Dockstore133 for workflows and GitHub134 for data sets as well as ensuring representativeness require
code are useful in supporting reproducibility. funding and incentives. Current funding focuses on
Sharing of summary-​level results with sufficient the creation and analysis of individual data sets rather
information for follow-​up is especially necessary for than on broad interoperability among shared data sets.
common control studies. Standard recommenda- Although there have been recent efforts by funding
tions exist for GWAS of common or low-​f requency agencies143 and organizations such as the Global Alliance
variants107,135 and are being developed for rare variant for Genomics and Health (GA4GH)144–147 to support
analyses47. In addition to these standards, information phenotype harmonization and data sharing standards,
resulting from the harmonization process, such as allele such as common data elements and passports to enable
counts, frequencies before and after harmonization, and authentication and access, a commitment to continued
study-​specific allele frequencies when multiple con- long-​term funding of these and other resources is needed.
trol sets are used, should be provided for common control The robust use of common controls requires careful
summary statistics. As summary statistics from analy­ consideration and specialized training in how to access
ses using common controls become more prevalent, data and complete robust analyses. Training programmes
the question of how to complete follow-​up analyses currently exist for access and use of large-​scale genetic
from studies that utilize the same control set needs to data sets including the BioData Catalyst Fellowship
be addressed. Methods such as the Bayesian Multiple Program 148, AnVIL’s Massive Genome Informatics
Rare Variant and Phenotype framework harness corre- in the Cloud (MaGIC) Jamboree149 and the GA4GH
lation, scale and/or direction of genetic effects to assess starter kit150. Further integration of training into exist-
summary statistics from a broad range of rare variant ing quantitative human genetics programmes through
association study designs including multiple diseases institutions and professional societies would be benefi-
and shared controls136. cial. Funding for time and travel is necessary to support
equity in training, especially for researchers who depend
Concluding perspectives on accessible public resources.
Foundational to the continued generation and use of Genomic research requires an enormous amount of
common control data are the urgency to better repre- resources, which can exacerbate disparities in research
sent global populations in the data sets, the researchers and, ultimately, health outcomes for under-​represented
charged with generating new data sets and infrastruc- groups. With the increasing availability of genome
ture, and users of the resources48,137,138. Deliberate atten- sequencing and low-​cost genotyping arrays, common
tion to representation at every stage in the process control data sets are an invaluable resource to the wider
ensures that challenges are addressed prospectively research community in addressing these disparities.
rather than as an afterthought or overlooked entirely139. Although common controls are and will likely continue to
Importantly, the use of common control data should not be widely used in the future, there exist no comprehensive
preclude the design and execution of large, high-​quality guidelines and summary of the current resources to sup-
studies of diverse genetic ancestries and environments. port robust use of common controls. As such, many stud-
Indeed, such large, diverse data sets are urgently needed ies have used common controls incorrectly (for example,
to increase representation of publicly available resources, owing to improper filtering or mismatch of ancestry), and
which may also serve as common controls. quality control and analysis modifications would have
Any development and maintenance of infrastructure resulted in more robust studies. Importantly, the need
for the use of common controls, and data sharing in gen- for appropriate calibration of test statistics when using
eral, must prioritize accessibility in terms of cost and ease common control data sets is not eliminated by increas-
of use to close the resource gap, or else inequities regard- ing sample size; although this approach leads to reduced
ing who is able to conduct high-​quality research with variance of estimates and greater power for discoveries,
even ‘publicly available’107,140 data will persist. Structured it can also open the door to additional batch effects and
cloud computing and levelled cost structures, such as in other technical confounders. Thus, regardless of the size,
the UK Biobank, can help address disparities in access. leveraging external data requires careful consideration.
The need for broad representation and equity of access Used rigorously, common controls hold the potential to
for data resources does not necessarily mean that all data expand the impact of human genetics research across
should be publicly available. There is a balance between a breadth of environments, conditions and populations.
open data and data sovereignty, especially for historically
marginalized groups, cultures and countries141. Although Published online 17 May 2022

676 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

1. McGuire, A. L. et al. The road ahead in genetics and 25. Hillman, P. et al. Identification of novel candidate risk phenotyping algorithms. AMIA Annu. Symp. Proc.
genomics. Nat. Rev. Genet. 21, 581–596 (2020). genes for myelomeningocele within the glucose 2011, 274–283 (2011).
Perspective from a panel of leading genetics homeostasis/oxidative stress and folate/one-​carbon 51. Newton, K. M. et al. Validation of electronic
experts across the world describing the current metabolism networks. Mol. Genet. Genom. Med. 8, medical record-​based phenotyping algorithms:
state of the field and where genetics should go to e1495 (2020). results and lessons learned from the eMERGE
ensure that the insights gained by modern genomic 26. Hebert, L. et al. Burden of rare deleterious variants in network. J. Am. Med. Inform. Assoc. 20, e147–e154
research will benefit all. WNT signaling genes among 511 myelomeningocele (2013).
2. Rehm, H. L. et al. ClinGen — the clinical genome patients. PLoS ONE 15, e0239083 (2020). 52. Shang, N. et al. Making work visible for electronic
resource. N. Engl. J. Med. 372, 2235–2242 (2015). 27. Yuan, J.-H. et al. Genomic analysis of 21 patients with phenotype implementation: lessons learned from the
3. Wang, Q. et al. Rare variant contribution to human corneal neuralgia after refractive surgery. Pain Rep. 5, eMERGE network. J. Biomed. Inform. 99, 103293
disease in 281,104 UK Biobank exomes. Nature 597, e826 (2020). (2019).
527–532 (2021). 28. Rojas, R. A. et al. Phenotypic continuum 53. Davis, K. A. S. et al. Indicators of mental disorders
4. Szustakowski, J. D. et al. Advancing human between Waardenburg syndrome and idiopathic in UK Biobank — a comparison of approaches. Int. J.
genetics research and drug discovery through hypogonadotropic hypogonadism in humans with Methods Psychiatr. Res. 28, e1796 (2019).
exome sequencing of the UK Biobank. Nat. Genet. 53, SOX10 variants. Genet. Med. 23, 629–636 (2021). 54. Singh, T. et al. Rare coding variants in ten genes
942–948 (2021). 29. Terradas, M. et al. TP53, a gene for colorectal cancer confer substantial risk for schizophrenia. Nature 604,
5. Gibbs, R. A. The Human Genome Project changed predisposition in the absence of Li–Fraumeni-​ 509–516 (2022).
everything. Nat. Rev. Genet. 21, 575–576 (2020). associated phenotypes. Gut 70, 1139–1146 55. Ledford, H. Paper on genetics of longevity retracted.
6. UK10K Consortium et al. The UK10K project identifies (2021). Nature https://doi.org/10.1038/news.2011.429
rare variants in health and disease. Nature 526, 30. Li, C. et al. Mutation analysis of LRP10 in a large (2011).
82–90 (2015). Chinese familial Parkinson disease cohort. Neurobiol. 56. Viering, D. H. H. M. et al. Genetics of renovascular
7. Minikel, E. V. et al. Evaluating drug targets through Aging 99, 99.e1–99.e6 (2021). hypertension in children. J. Hypertens. 38,
human loss-​of-function genetic variation. Nature 581, 31. Gunadi et al. Effect of semaphorin 3C gene variants in 1964–1970 (2020).
459–464 (2020). multifactorial Hirschsprung disease. J. Int. Med. Res. 57. Mazzarotto, F. et al. Reevaluating the genetic
8. Banka, S. et al. How genetically heterogeneous is 49, 300060520987789 (2021). contribution of monogenic dilated cardiomyopathy.
Kabuki syndrome?: MLL2 testing in 116 patients, 32. Messina, A. et al. Neuron-​derived neurotrophic Circulation 141, 387–398 (2020).
review and analyses of mutation and phenotypic factor is mutated in congenital hypogonadotropic 58. Steel, D. et al. Loss-​of-function variants in HOPS
spectrum. Eur. J. Hum. Genet. 20, 381–388 (2012). hypogonadism. Am. J. Hum. Genet. 106, 58–70 complex genes VPS16 and VPS41 cause early onset
9. Biesecker, L. G. Exome sequencing makes medical (2020). dystonia associated with lysosomal abnormalities.
genomics a reality. Nat. Genet. 42, 13–14 (2010). 33. Trimarchi, M. et al. Gene expression analysis in Ann. Neurol. 88, 867–877 (2020).
10. Ng, S. B. et al. Exome sequencing identifies the cause patients with cocaine-​induced midline destructive 59. Johnson, J. O. et al. Association of variants in the
of a Mendelian disorder. Nat. Genet. 42, 30–35 lesions. Medicina 57, 861 (2021). SPTLC1 gene with juvenile amyotrophic lateral
(2010). 34. Marenne, G. et al. Exome sequencing identifies sclerosis. JAMA Neurol. 78, 1236–1248 (2021).
11. Akbari, P. et al. Sequencing of 640,000 exomes genes and gene sets contributing to severe childhood 60. Gallego-​Martinez, A., Requena, T., Roman-​Naranjo, P.,
identifies GPR75 variants associated with protection obesity, linking PHIP variants to repressed POMC May, P. & Lopez-​Escamez, J. A. Enrichment of
from obesity. Science 373, eabf8683 (2021). transcription. Cell Metab. 31, 1107–1119.e12 damaging missense variants in genes related with
12. Flannick, J. et al. Exome sequencing of 20,791 cases (2020). axonal guidance signalling in sporadic Meniere’s
of type 2 diabetes and 24,440 controls. Nature 570, 35. Singh, T. et al. Rare loss-​of-function variants in disease. J. Med. Genet. 57, 82–88 (2020).
71–76 (2019). SETD1A are associated with schizophrenia and 61. Kwok, A. J., Mentzer, A. & Knight, J. C. Host genetics
13. Backman, J. D. et al. Exome sequencing and analysis developmental disorders. Nat. Neurosci. 19, and infectious disease: new tools, insights and
of 454,787 UK Biobank participants. Nature 599, 571–577 (2016). translational opportunities. Nat. Rev. Genet. 22,
628–634 (2021). 36. Sazonovs, A. et al. Sequencing of over 100,000 137–153 (2021).
Initial description of the data and potential provided individuals identifies multiple genes and rare 62. Fry, A. et al. Comparison of sociodemographic
by exomes for medical and genomic applications variants associated with Crohns disease susceptibility. and health-​related characteristics of UK biobank
across the UK Biobank. Preprint at bioRxiv https://doi.org/10.1101/ participants with those of the general population.
14. Martin, A. R. et al. Human demographic history 2021.06.15.21258641 (2021). Am. J. Epidemiol. 186, 1026–1034 (2017).
impacts genetic risk prediction across diverse 37. Malki, L. et al. Variant PADI3 in central centrifugal 63. Wright, C. F. et al. Assessing the pathogenicity,
populations. Am. J. Hum. Genet. 100, 635–649 cicatricial alopecia. N. Engl. J. Med. 380, 833–841 penetrance, and expressivity of putative disease-​
(2017). (2019). causing variants in a population setting. Am. J. Hum.
15. Petrovski, S. & Goldstein, D. B. Unequal representation 38. Ulirsch, J. C. et al. The genetic landscape of Diamond– Genet. 104, 275 (2019).
of genetic variation across ancestry groups creates Blackfan anemia. Am. J. Hum. Genet. 103, 930–947 64. Povysil, G. et al. Rare-​variant collapsing analyses for
healthcare inequality in the application of precision (2018). complex traits: guidelines and applications. Nat. Rev.
medicine. Genome Biol. 17, 157 (2016). 39. Hubert, J.-N. et al. The PI3K/mTOR pathway is Genet. 20, 747–759 (2019).
16. Manrai, A. K. et al. Genetic misdiagnoses and the targeted by rare germline variants in patients with Review describing rare variant aggregation testing,
potential for health disparities. N. Engl. J. Med. 375, both melanoma and renal cell carcinoma. Cancers 13, a common method for association in sequencing
655–665 (2016). 2243 (2021). studies. Beyond describing techniques, the review
17. Wellcome Trust Case Control Consortium. Genome-​wide 40. Rashid, M. et al. ALPK1 hotspot mutation as a driver covers specific filtering and quality control needed
association study of 14,000 cases of seven common of human spiradenoma and spiradenocarcinoma. to ensure appropriate statistical calibration.
diseases and 3,000 shared controls. Nature 447, Nat. Commun. 10, 2213 (2019). 65. Riveros-​McKay, F. et al. Genetic architecture of human
661–678 (2007). 41. Belhadj, S. et al. Candidate genes for hereditary thinness compared to severe obesity. PLoS Genet. 15,
Foundational early genome-​wide association colorectal cancer: mutational screening and systematic e1007603 (2019).
study leveraging a common set of controls to review. Hum. Mutat. 41, 1563–1576 (2020). 66. Moskvina, V., Holmans, P., Schmidt, K. M. &
enhance discovery possibility across seven 42. Mosquera Orgueira, A. et al. Detection of rare Craddock, N. Design of case–controls studies with
diseases. The paper includes stringent QC now germline variants in the genomes of patients with unscreened controls. Ann. Hum. Genet. 69, 566–576
common to ensure homogeneity across a common B-cell neoplasms. Cancers 13, 1340 (2021). (2005).
control data set. 43. Li, C. et al. Targeted next generation sequencing of 67. Sham, P. C. & Purcell, S. M. Statistical power and
18. Corredor-​Orlandelli, D. et al. Association between nine osteoporosis-​related genes in the Wnt signaling significance testing in large-​scale genetic studies.
paraoxonase-1 p.Q192R polymorphism and coronary pathway among Chinese postmenopausal women. Nat. Rev. Genet. 15, 335–346 (2014).
artery disease susceptibility in the Colombian Endocrine 68, 669–678 (2020). 68. Auer, P. L. et al. Guidelines for large-​scale sequence-​
population. Vasc. Health Risk Manag. 17, 689–699 44. Thorlund, K., Dron, L., Park, J. J. H. & Mills, E. J. based complex trait association studies: lessons
(2021). Synthetic and external controls in clinical trials — a learned from the NHLBI Exome Sequencing Project.
19. Tan, M. et al. Whole genome sequencing identifies primer for researchers. Clin. Epidemiol. 12, 457–467 Am. J. Hum. Genet. 99, 791–801 (2016).
rare germline variants enriched in cancer related (2020). 69. Alberts, B. Editorial expression of concern. Science
genes in first degree relatives of familial pancreatic 45. Popejoy, A. B. & Fullerton, S. M. Genomics is failing 330, 912 (2010).
cancer patients. Clin. Genet. 100, 551–562 (2021). on diversity. Nature 538, 161–164 (2016). 70. Campbell, C. D. et al. Demonstrating stratification
20. Taroc, E. Z. M. et al. Gli3 regulates vomeronasal 46. Ben-​Eghan, C. et al. Don’t ignore genetic data in a European American population. Nat. Genet. 37,
neurogenesis, olfactory ensheathing cell formation, from minority populations. Nature 585, 184–186 868–872 (2005).
and GnRH-1 neuronal migration. J. Neurosci. 40, (2020). 71. Knowler, W. C., Williams, R. C., Pettitt, D. J. &
311–326 (2020). 47. McMahon, A. et al. Sequencing-​based genome-​wide Steinberg, A. G. Gm3;5,13,14 and type 2 diabetes
21. Muskens, I. S. et al. Germline cancer predisposition association studies reporting standards. Cell Genomics mellitus: an association in American Indians with
variants and pediatric glioma: a population-​based 1, 100005 (2021). genetic admixture. Am. J. Hum. Genet. 43, 520–526
study in California. Neuro. Oncol. 22, 864–874 48. Gurdasani, D., Barroso, I., Zeggini, E. & Sandhu, M. S. (1988).
(2020). Genomics of disease risk in globally diverse populations. 72. Hellwege, J. N. et al. Population stratification in
22. Lorenzo-​Salazar, J. M. et al. Novel idiopathic Nat. Rev. Genet. 20, 520–535 (2019). genetic association studies. Curr. Protoc. Hum. Genet.
pulmonary fibrosis susceptibility variants revealed This paper provides a summary of the current 95, 1.22.1–1.22.23 (2017).
by deep sequencing. ERJ Open Res. 5, 00071 (2019). state of genomic diversity in research and how 73. Choudhry, S. et al. Population stratification confounds
23. Georges, A. et al. Rare loss-​of-function mutations diversity is key to discovery and translation in genetic association studies among Latinos. Hum. Genet.
of PTGIR are enriched in fibromuscular dysplasia. genomics. 118, 652–664 (2006).
Cardiovasc. Res. 117, 1154–1165 (2021). 49. Zhang, Y. et al. The prevalence of vitiligo: a meta- 74. Helgason, A., Yngvadóttir, B., Hrafnkelsson, B.,
24. Li, C. et al. Mutation analysis of DNAJC family for analysis. PLoS ONE 11, e0163806 (2016). Gulcher, J. & Stefánsson, K. An Icelandic example
early-​onset Parkinson’s disease in a Chinese cohort. 50. Conway, M. et al. Analyzing the heterogeneity and of the impact of population structure on association
Mov. Disord. 35, 2068–2076 (2020). complexity of electronic health record oriented studies. Nat. Genet. 37, 90–95 (2005).

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 677

0123456789();:
Reviews

75. Panarella, M. & Burkett, K. M. A cautionary note on Analysis, Visualization, and Informatics Lab-space. 124. Kircher, M. et al. A general framework for estimating
the effects of population stratification under an extreme Cell Genom. 2, 100085 (2022). the relative pathogenicity of human genetic variants.
phenotype sampling design. Front. Genet. 10, 398 102. National Heart, Lung, and Blood Institute, National Nat. Genet. 46, 310–315 (2014).
(2019). Institutes of Health, US Department of Health 125. Adzhubei, I. A. et al. A method and server for predicting
76. Gravel, S. et al. Demographic history and rare allele and Human Services. The NHLBI BioData catalyst. damaging missense mutations. Nat. Methods 7,
sharing among human populations. Proc. Natl Acad. Zenodo https://doi.org/10.5281/zenodo.3822858 248–249 (2010).
Sci. USA 108, 11983–11988 (2011). (2020). 126. Sim, N.-L. et al. SIFT web server: predicting effects
77. Mathieson, I. & McVean, G. Differential confounding 103. All of Us Research Program Investigators et al. The of amino acid substitutions on proteins. Nucleic Acids
of rare and common variants in spatially structured “All of Us” Research Program. N. Engl. J. Med. 381, Res. 40, W452–W457 (2012).
populations. Nat. Genet. 44, 243–246 (2012). 668–676 (2019). 127. Li, Y. & Lee, S. Novel score test to increase power
78. O’Connor, T. D. et al. Fine-​scale patterns of population 104. Langmead, B. & Nellore, A. Cloud computing for in association test by integrating external controls.
stratification confound rare variant association tests. genomic data analysis and collaboration. Nat. Rev. Genet. Epidemiol. 45, 293–304 (2021).
PLoS ONE 8, e65834 (2013). Genet. 19, 208–219 (2018). 128. Chen, S. & Lin, X. Analysis in case–control
79. Klann, J. G., Joss, M. A. H., Embree, K. & Murphy, S. N. This paper reviews how the current and future sequencing association studies with different
Data model harmonization for the All Of Us Research state of cloud computing will be fundamental sequencing depths. Biostatistics 21, 577–593
Program: transforming i2b2 data into the OMOP for large-​scale genomics research including for (2020).
common data model. PLoS ONE 14, e0212463 collaboration and reproducibility. 129. Hu, Y.-J., Liao, P., Johnston, H. R., Allen, A. S. &
(2019). 105. Van der Auwera, G. A. & O’Connor, B. D. Genomics Satten, G. A. Testing rare-​variant association without
80. Wei, W.-Q. et al. Evaluating phecodes, clinical in the Cloud: Using Docker, GATK, and WDL in Terra calling genotypes allows for systematic differences in
classification software, and ICD-9-CM codes for (O’Reilly Media, 2020). sequencing between cases and controls. PLoS Genet.
phenome-​wide association studies in the electronic 106. Yuen, D. et al. The Dockstore: enhancing a community 12, e1006040 (2016).
health record. PLoS ONE 12, e0175508 (2017). platform for sharing reproducible and accessible 130. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded
81. Leitsalu, L. et al. Cohort profile: Estonian Biobank computational protocols. Nucleic Acids Res. 49, view of complex traits: from polygenic to omnigenic.
of the Estonian Genome Center, University of Tartu. W624–W632 (2021). Cell 169, 1177–1186 (2017).
Int. J. Epidemiol. 44, 1137–1147 (2015). 107. Uffelmann, E. et al. Genome-​wide association studies. 131. Bulik-​Sullivan, B. K. et al. LD score regression
82. Choudhury, A. et al. Author correction: High-​depth Nat. Rev. Methods Primers 1, 60 (2021). distinguishes confounding from polygenicity in
African genomes inform human migration and health. 108. Purcell, S. et al. PLINK: a tool set for whole-genome genome-​wide association studies. Nat. Genet. 47,
Nature 592, E26 (2021). association and population-​based linkage analyses. 291–295 (2015).
83. Di Angelantonio, E. et al. Efficiency and safety Am. J. Hum. Genet. 81, 559–575 (2007). 132. Clifton, E. A. D. et al. Associations between body
of varying the frequency of whole blood donation 109. Chang, C. C. et al. Second-​generation PLINK: rising to mass index-​related genetic variants and adult body
(INTERVAL): a randomised trial of 45 000 donors. the challenge of larger and richer datasets. Gigascience composition: the Fenland cohort study. Int. J. Obes.
Lancet 390, 2360–2371 (2017). 4, 7 (2015). 41, 613–619 (2017).
84. Taliun, D. et al. Sequencing of 53,831 diverse genomes 110. Alexander, D. H. & Lange, K. Enhancements to the 133. O’Connor, B. D. et al. The Dockstore: enabling
from the NHLBI TOPMed Program. Nature 590, ADMIXTURE algorithm for individual ancestry modular, community-​focused sharing of Docker-​based
290–299 (2021). estimation. BMC Bioinformatics 12, 246 (2011). genomics tools and workflows. F1000Res. 6, 52
85. Bycroft, C. et al. The UK Biobank resource with 111. Reich, D., Price, A. L. & Patterson, N. Principal (2017).
deep phenotyping and genomic data. Nature 562, component analysis of genetic data. Nat. Genet. 40, 134. Perkel, J. Democratic databases: science on GitHub.
203–209 (2018). 491–492 (2008). Nature 538, 127–128 (2016).
86. Gutierrez-Sacristan, A. et al. GenoPheno: cataloging 112. Wang, C., Zhan, X., Liang, L., Abecasis, G. R. & Lin, X. 135. Buniello, A. et al. The NHGRI-​EBI GWAS Catalog of
large-scale phenotypic and next-generation sequencing Improved ancestry estimation for both genotyping and published genome-​wide association studies, targeted
data within human datasets. Brief Bioinform. 22, sequencing data using projection procrustes analysis arrays and summary statistics 2019. Nucleic Acids
55–65 (2021). and genotype imputation. Am. J. Hum. Genet. 96, Res. 47, D1005–D1012 (2019).
87. FinnGen. FinnGen documentation of R5 release. 926–937 (2015). 136. Venkataraman G.R. et al. Bayesian model comparison
FinnGen https://finngen.gitbook.io/documentation/ 113. Price, A. L. et al. Principal components analysis for rare-​variant association studies. Am. J. Hum. Genet.
(2021). corrects for stratification in genome-​wide association 108, 2354–2367 (2021).
88. Wei, C.-Y. et al. Genetic profiles of 103,106 individuals studies. Nat. Genet. 38, 904–909 (2006). 137. Thomas, S. P. et al. Cultivating diversity as an ethos
in the Taiwan Biobank provide insights into the health 114. 1000 Genomes Project Consortium et al. A global with an anti-​racism approach in the scientific enterprise.
and history of Han Chinese. NPJ Genom. Med. 6, 10 reference for human genetic variation. Nature 526, HGG Adv. 108, 100052 (2021).
(2021). 68–74 (2015). 138. Bonham, V. L. & Green, E. D. The genomics workforce
89. Karczewski, K. J., Francioli, L. C. & MacArthur, D. G. 115. Bergström, A. et al. Insights into human must become more diverse: a strategic imperative.
The mutational constraint spectrum quantified from genetic variation and population history from Am. J. Hum. Genet. 108, 3–7 (2021).
variation in 141,456 humans. Nature 581, 434–443 929 diverse genomes. Science 367, eaay5012 139. Bentley, A. R., Callier, S. L. & Rotimi, C. N. Evaluating
(2020). (2020). the promise of inclusion of African ancestry populations
90. Peña-​Chilet, M. et al. CSVS, a crowdsourcing database 116. GenomeAsia100K Consortium. The GenomeAsia in genomics. NPJ Genom. Med. 5, 5 (2020).
of the Spanish population genetic variability. Nucleic 100K Project enables genetic discoveries across Asia. 140. Bezuidenhout, L. & Chakauya, E. Hidden concerns
Acids Res. 49, D1130–D1137 (2021). Nature 576, 106–111 (2019). of sharing research data by low/middle-​income country
91. Mailman, M. D. et al. The NCBI dbGaP Database 117. Maples, B. K., Gravel, S., Kenny, E. E. & scientists. Glob. Bioeth. 29, 39–54 (2018).
of Genotypes and Phenotypes. Nat. Genet. 39, Bustamante, C. D. RFMix: a discriminative 141. Tsosie, K. S., Yracheta, J. M. & Dickenson, D.
1181–1186 (2007). modeling approach for rapid and robust local- Overvaluing individual consent ignores risks to
92. Lappalainen, I. et al. The European Genome–Phenome ancestry inference. Am. J. Hum. Genet. 93, 278–288 tribal participants. Nat. Rev. Genet. 20, 497–498
Archive of human data consented for biomedical (2013). (2019).
research. Nat. Genet. 47, 692–695 (2015). 118. Hilmarsson, H. et al. High resolution ancestry 142. Tindana, P. & de Vries, J. Broad consent for genomic
93. UK Biobank. New costs for 2021. UK Biobank https:// deconvolution for next generation genomic data. research and biobanking: perspectives from low- and
www.ukbiobank.ac.uk/enable-​your-research/costs Preprint at bioRxiv https://doi.org/10.1101/ middle-​income countries. Annu. Rev. Genomics Hum.
(2021). 2021.09.19.460980 (2021). Genet. 17, 375–393 (2016).
94. Lee, S., Kim, S. & Fuchsberger, C. Improving power 119. Arriaga-​MacKenzie, I. S. et al. Summix: a method A review outlining the key elements to promote
for rare-​variant tests by integrating external controls. for detecting and adjusting for population structure global health and equity when completing genomic
Genet. Epidemiol. 41, 610–619 (2017). in genetic summary data. Am. J. Hum. Genet. 108, research, such as through biobanks.
95. Hendricks, A. E. et al. ProxECAT: Proxy External 1270–1282 (2021). 143. National Human Genome Research Institute. NOT-​
Controls Association Test. A new case–control gene 120. Wojcik, G. L. et al. Genetic analyses of diverse HG-21-022: notice announcing the National Human
region association test using allele frequencies from populations improves discovery for complex traits. Genome Research Institute’s expectation for sharing
public controls. PLoS Genet. 14, e1007591 (2018). Nature 570, 514–518 (2019). quality metadata and phenotypic data. NIH https://
96. Guo, M. H., Plummer, L., Chan, Y.-M., Hirschhorn, J. N. A large, multi-​ethnic, multi-​trait genome-​wide grants.nih.gov/grants/guide/notice-​files/NOT-​HG-21-
& Lippincott, M. F. Burden testing of rare variants association study paper from the Population 022.html (2021).
identified through exome sequencing via publicly Architecture using Genomics and Epidemiology 144. Fiume, M. et al. Federated discovery and sharing
available control data. Am. J. Hum. Genet. 103, (PAGE) study describing best practices for of genomic data using Beacons. Nat. Biotechnol. 37,
522–534 (2018). handling heterogeneous population data, 220–224 (2019).
97. Jiang, L. et al. Deviation from baseline mutation including imputation, filtering and QC steps. 145. Thorogood, A. et al. International federation of
burden provides powerful and robust rare-​variants The paper also describes the critical importance genomic medicine databases using GA4GH standards.
association test for complex diseases. Nucleic Acids of genomic diversity in genetic association Cell Genomics 1, 100032 (2021).
Res. 50, e34 (2022). studies. 146. Rehm, H. L. et al. GA4GH: international policies
98. Lali, R. et al. Calibrated rare variant genetic risk scores 121. Choudhury, A. et al. High-​depth African genomes and standards for data sharing across genomic
for complex disease prediction using large exome inform human migration and health. Nature 586, research and healthcare. Cell Genom. 1, 100029
sequence repositories. Nat. Commun. 12, 5852 (2021). 741–748 (2020). (2021).
99. Bodea, C. A. et al. A method to exploit the structure 122. Exome Variant Server. NHLBI Exome Sequencing 147. Lawson, J. et al. The Data Use Ontology to streamline
of genetic ancestry space to enhance case–control Project (ESP). EVS http://evs.gs.washington.edu/EVS/ responsible access to human biomedical datasets.
studies. Am. J. Hum. Genet. 98, 857–868 (2016). (2013). Cell Genom. 1, 100028 (2021).
100. Das, S. et al. Next-​generation genotype imputation 123. Li, X. et al. Dynamic incorporation of multiple 148. National Heart, Lung, and Blood Institute. Catalyst
service and methods. Nat. Genet. 48, 1284–1287 in silico functional annotations empowers rare Fellows Program. NHLBI https://biodatacatalyst.nhlbi.
(2016). variant association analysis of large whole-​genome nih.gov/fellows/program/ (2021).
101. Schatz, M. C. et al. Inverting the model of genomics sequencing studies at scale. Nat. Genet. 52, 969–983 149. National Human Genome Research Institute. Massive
data sharing with the NHGRI Genomic Data Science (2020). Genome Informatics in the Cloud (MaGIC) Jamboree.

678 | November 2022 | volume 23 www.nature.com/nrg

0123456789();:
Reviews

AnVIL https://anvilproject.org/events/magic2020 sharing that accelerates research. Nat. Rev. Genet. 21,
(2020). 615–629 (2020). Related links
150. Global Alliance for Genomics and Health. GA4GH 162. Wilkinson, M. D. et al. The FAIR Guiding Principles 1000 Genomes Project: https://www.internationalgenome.
starter kit. GA4GH https://starterkit.ga4gh.org/ for scientific data management and stewardship. org
(2021). Sci. Data 3, 160018 (2016). All of Us: https://www.researchallofus.org/
151. Abel, H. J. et al. Mapping and characterization This foundational manuscript is the first to present AnviL: https://anvilproject.org
of structural variation in 17,795 human genomes. the FAIR principles (that is, findable, accessible, BioData Catalyst: https://biodatacatalyst.nhlbi.nih.gov
Nature 583, 83–89 (2020). interoperable and reusable) for data sharing. CCDG: https://ccdg.rutgers.edu
152. Mallick, S. et al. The Simons Genome Diversity Project: Csvs: http://csvs.babelomics.org/
300 genomes from 142 diverse populations. Nature Acknowledgements dbGaP: https://www.ncbi.nlm.nih.gov/gap/
538, 201–206 (2016). This work was supported by the Genome Sequencing Program dbGaP ALFA: https://www.ncbi.nlm.nih.gov/snp/docs/gsr/
153. Phan, L. et al. ALFA: Allele Frequency Aggregator. (R35HG011293 to A.E.H. and C.R.G.; U01HG009080 to alfa/
NCBI https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/ A.E.H., A.G.I., C.R.G. and M.A.R.; and U24HG008956 eGA: https://ega-archive.org
(2020). to S.B.). The Genome Sequencing Program is funded by the estonian Biobank: https://genomics.ut.ee/en/content/
154. Tadaka, S. et al. jMorp updates in 2020: large National Institute of Health (NIH) National Human Genome estonian-biobank
enhancement of multi-​omics data resources on the Research Institute (NHGRI), the National Heart, Lung, and FinnGen: https://finngen.gitbook.io/documentation/
general Japanese population. Nucleic Acids Res. 49, Blood Institute (NHLBI) and the National Eye Institute (NEI). data-download
D536–D544 (2021). G.L.W. received support for this work from NHGRI GenomeAsia 100K: https://browser.genomeasia100k.org
155. Sequencing Initiative Suomi Project. Sequencing (R35HG011944). gnomAD v.2.1: https://gnomad.broadinstitute.org/
Initiative Suomi. SISu http://sisuproject.fi (2021). downloads
156. Wam. Dubai to map genome of all its residents. Author contributions gnomAD v.3.1: https://gnomad.broadinstitute.org/downloads
Khaleej Times https://www.khaleejtimes.com/uae/ G.L.W., J.M., J.L.E. and A.E.H. researched the literature. H3Africa: https://catalog.h3africa.org
dubai-​to-map-​genome-of-​all-its-​residents (2018). G.L.W., J.M., A.G.I., S.B. and A.E.H. provided substantial con- HGDP: https://www.internationalgenome.org/data-portal/
157. Geis, C. A Chinese province is sequencing one million of tributions to discussion of the content. G.L.W., J.M., S.B. and data-collection/hgdp
its residents’ genomes. Futurism https://futurism.com/ A.E.H. wrote the article. All authors reviewed and/or edited iNTeRvAL: https://www.intervalstudy.org.uk
neoscope/chinese-​province-sequencing-1-million- the manuscript before submission. jMorp: https://jmorp.megabank.tohoku.ac.jp/202109/
residents-genomes (2017). downloads/
158. Health RI. European ‘1+Million Genomes’ Competing interests Researcher workbench: https://www.researchallofus.org/
initiative (1+MG). Health RI https://www.health-​ri.nl/ C.R.G. owns stock in 23and Me. M.A.R. is a scientific founder data-​tools/workbench/
initiatives/european-1million-​genomes-initiative-1mg of Broadwing Bio, a consultant for MazeTx, and is currently sGDP: https://cloud.google.com/life-sciences/docs/
(2020). on leave at HiBio. The other authors declare no competing resources/public-datasets/simons
159. Gaziano, J. M. et al. Million Veteran Program: interests. sisu v4.1: https://sisuproject.fi
a mega-biobank to study genetic influences on Taiwan Biobank: https://taiwanview.twbiobank.org.tw/
health and disease. J. Clin. Epidemiol. 70, 214–223 Peer review information browse38
(2016). Nature Reviews Genetics thanks the anonymous reviewer(s) TOPMed: https://topmed.nhlbi.nih.gov
160. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The for their contribution to the peer review of this work. TOPMed Bravo: https://bravo.sph.umich.edu/freeze8/hg38/
missing diversity in human genetic studies. Cell 177, UK Biobank: https://biobank.ctsu.ox.ac.uk/crystal/label.
1080 (2019). Publisher’s note cgi?id=263
161. Byrd, J. B., Greene, A. C., Prasad, D. V., Jiang, X. Springer Nature remains neutral with regard to jurisdictional
& Greene, C. S. Responsible, practical genomic data claims in published maps and institutional affiliations. © Springer Nature Limited 2022

NATure RevIeWS | GeneTICS volume 23 | November 2022 | 679

0123456789();:

You might also like