You are on page 1of 12

ANALYSIS

Human gene essentiality


István Bartha1, Julia di Iulio1, J. Craig Venter1,2 and Amalio Telenti1,2
Abstract | A gene can be defined as essential when loss of its function compromises viability of the
individual (for example, embryonic lethality) or results in profound loss of fitness. At the
population level, identification of essential genes is accomplished by observing intolerance to
loss‑of‑function variants. Several computational methods are available to score gene essentiality,
and recent progress has been made in defining essentiality in the non-coding genome.
Haploinsufficiency is emerging as a critical aspect of gene essentiality: approximately 3,000
human genes cannot tolerate loss of one of the two alleles. Genes identified as essential in human
cell lines or knockout mice may be distinct from those in living humans. Reconciling these
discrepancies in how we evaluate gene essentiality has applications in clinical genetics and may
offer insights for drug development.

Minimal genome
Gene essentiality emerged as a concept associated with variants may exert their influence through a mecha-
A genome limited to the the notion of the minimal genome1–3. Systematic deletions nism of haploinsufficiency. Haploinsufficiency is defined
essential genes for life. of one or more genes in prokaryotic or simple eukary- as a dominant phenotype in diploid organisms that are
otic experimental models have been generated, scored heterozygous for a loss‑of‑function allele6. In the case
Robustness
for viability and fitness, and assessed for mechanisms of of putative protein-truncating variants such as stop-
The ability of a biological
system to keep its behaviour robustness, redundancy and evolvability4. Progress in tech- gain variants, expression data convincingly demonstrate
unchanged under perturbation. nologies for editing and silencing genes extended the that loss of one allele is not remediated by dosage com-
work to more complex models, such as worms and mice. pensation by the intact allele7–12 and thus can result in
Redundancy These studies prompted the identification of essential haplo­insufficiency. Human disease studies have iden-
The possibility of having a
function encoded by more
components that are shared across different forms of life5. tified several hundred haploinsufficient genes13. At the
than one gene. Human gene essentiality was first associated with same time, sequencing projects with large populations
the study of Mendelian diseases, which generally have also identified common loss‑of‑function variants
Evolvability reflect the consequences of severe genetic lesions on occurring in a homozygous null state in individuals
The degree to which an
human fitness. More recently, CRISPR–Cas9 genome- who appear healthy. Thus, it is becoming increasingly
organism can generate
adaptive solutions to future wide screens have targeted every human gene in tumour possible to define both an ‘essential genome’ (which
environments through cell lines. These assays measure the consequences of includes genes that do not tolerate loss of function) and
heritable phenotypic variation. gene disruption in cell viability assays. However, this a ‘dispensable genome’ (which includes genes that can
information does not necessarily translate to gene essen- be observed with biallelic inactivation in the general
Exome
The subset of the genome that
tiality in vivo, which may be assessed through the collec- population). This paper reviews the phenotypic con-
is part of mature RNAs and tion of genome sequencing data at the population level. sequences of both rare (heterozygous) and common
translated into proteins. Therefore, this Analysis article has a strong in vivo organ- (homozygous) loss‑of‑function variants.
ismal focus supported by advances in genome sequencing We also present similarities and differences among
technologies. In large-scale human genome sequencing the genetic requirements for a cell to be viable, for via-
projects, essential genes will be those that are rarely or ble progeny in the mouse knockout models, and for a
1
Human Longevity Inc.,
never disrupted or truncated in the general population. human to reach adult life. Thus, a goal of this article
San Diego, Here, we review new metrics applied to large exome is to analyse and make sense of these disparate data
California 92121, USA. and genome sequence data sets for the purpose of sets obtained using conceptually and technically dif-
2
J. Craig Venter Institute, identifying essential genes. Insights from these met- ferent approaches. As a corollary, this article presents
Capricorn Lane, La Jolla,
rics reveal the extent of predicted protein truncation and the extension of the concept of essentiality to the non-­
California 92037, USA.
other loss‑of‑function variants in the human population. coding genome. The identification of the most essen-
Correspondence to
J.C.V. and A.T.
Importantly, most human loss‑of‑function variants tial elements of the genome in the human population
jcventer@jcvi.org; are rare. Thus, although currently available data are provides new insights into human diversity and also
atelenti@jcvi.org too sparse for identifying combinations of rare alleles, has practical medical applications for disease genetics
doi:10.1038/nrg.2017.75 there is considerable statistical power when analysing and the clinical interpretation of disease-associated
Published online 30 Oct 2017 them in a heterozygous state. Rare loss‑of‑function genetic variants.

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 1


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Metrics of gene essentiality variation. The various scores were developed or updated
There are now enough sequenced human exomes and using the Exome Aggregation Consortium (ExAC)
genomes to saturate certain functional elements (for sample of 60,706 human exomes15. Therefore, the use of
example, exons) or mutational classes (for example, these metrics on the same data set allows the assessment
protein-­truncating variants)14–16. Genomic regions or of agreement. Indeed, these scores are highly correlated
genes that tolerate variation may carry a high number with one another (FIG. 1). The highest correlation arises
of variants, whereas genes that are intolerant to vari- between the two scores (Phi and pLI) that use Poisson
Protein truncation ation will show a relative depletion. Individuals with mixture models in their approaches (Supplementary
A truncated, incomplete and
loss‑of‑function variants in essential genes may not be information S1 (box)). The key differential character-
usually nonfunctional protein
product. Generally, the result represented in healthy adult cohorts. istics of the scores are presented in TABLE 1. All scores’
of stop-gain, frameshift or Various scores are proposed to quantify the tol- URLs and values are included in Supplementary
splice-donor genetic variants. erance of a gene to loss‑of‑function variants using information S2 (table).
in vivo population-level data of human genetic varia- The performance of the scores is assessed by predict-
Loss‑of‑function variants
Genetic variants that severely
tion (TABLE 1). We compare the scores and discuss their ing variants causing Mendelian disorders (TABLE 1) and
disrupt the function of a differences while providing guidance for their practical by demonstrating their use for variant prioritization in
protein. These can be missense usage. It should be highlighted that the stated goal of the clinical genetic setting. It is, however, difficult to
(a change of the codon some of the metrics is the identification of damag- draw a uniform picture from the original publications,
resulting in a change in the
ing variants and is not explicitly the characterization as each tool was benchmarked against different truth
amino acid) or nonsense and
protein-truncating variants. of gene essentiality. The available scores differ in the sets and trained on different genetic data. For example,
assumptions, implementation and underlying data shet is capable of predicting the inheritance mode of a
Haploinsufficiency used in building the essentiality metric. An outstand- loss‑of‑­function variant with >85% accuracy and sensi-
In a diploid organism, having ing challenge in all essentiality metrics is in the cali- tivity, and most methods have an area under the receiver
only a single functional copy of
a gene (with the other copy
bration of an appropriate baseline expectation for the operating characteristic curve (ROC curve) of 70–90%
inactivated by mutation), which number of variants in a gene, which will depend on against Online Mendelian Inheritance in Man (OMIM)
is insufficient to maintain the length of the coding region, the local nucleotide haploinsufficient genes. No matter how they are imple-
proper gene function. context and mutation rates. Due to the more com- mented, all these methods are inherently dependent on
plete understanding of the coding genome and the the gene length, as it is difficult to assert the depletion
Stop-gain variants
Also known as nonsense better availability of human exome sequencing data, of observed variants against an already low expectation of
variants, changes in the genetic most of the scores are focused on missense or protein-­ the number of variants in a short gene. Short gene length
material that result in truncating variants, including frameshift variants, early will lead to false-negative calls (that is, genes erroneously
premature termination of the stop gains and splice-site variants. Synonymous variants being called as non-essential), but it will not lead to
translated protein.

are generally considered biologically silent and thus false-­positive calls of haploinsufficiency.
Saturate provide a natural way to estimate the amount of neu‑ Another important consideration is that these tools
When referring to the tral variation that is expected in each population for a estimate essentiality as a continuum of values but are
generation of gene variants given genomic region. However, it should be kept in frequently used as dichotomized scores. TABLE 1 indi-
genome-wide, the sample size
mind that synony­mous variants in the protein-coding cates the distribution of each of the scores and the cut-off
at which all positions in the
genome are seen variant at region might be under slight evolutionary constraint 17. (if any) that was used by their authors to evaluate patho-
least once. Synonymous variants may affect the function of the genicity. The broadly used cut-off of pLI >0.9, proposed
translated protein through diverse cellular mecha- by Lek et al.15, results in 3,230 genes that can be consid-
Frameshift variants nisms, including control of transcription rate, RNA ered highly essential. Here, for the purpose of comparing
Deletions or insertions in the
protein-coding region, the
structure and protein folding efficiency 18. the various scores and for comparisons between human
lengths of which are not The main principle of all scores is to rank genes by in vivo data, in vitro tumour cell data and mouse knock-
divisible by three, thus the footprint of negative selection acting on the genes out data, we display ranked continuous scores or use a
disrupting the reading frame of against protein-truncating variation. Petrovski’s ‘resid- cut-off of the 85th percentile, which approximates the
the gene.
ual variation intolerance score’ (RVIS)19 and Rackham’s commonly used pLI >0.9.
Synonymous variants EvoTol20 relate the amount of common loss‑of‑­function We highlight essentiality scores that are primarily
A change of nucleotide that variation to that of the total gene variation. Other grounded on human population genetic information
does not lead to changes in the scores are based on the original work of Samocha et al. explored using unsupervised analyses. There are also
amino-acid sequence of a (Missense Z‑score)21, which sets up a baseline expec­ supervised predictors of haploinsufficiency that integrate
protein.
tation of mutation count per gene based on the sequence genomic features, including functional annotation, met-
Neutral variation context, local mutation rate, sequencing depth and, most rics of network topology, evolutionary and intraspecies
Genetic variants that are not importantly, sample size. Fadista’s LoFtool22 combines population-level data. Representative works are those
subjects of natural selection. the neutral mutation rate of Samocha et al.21 and the of Dang et al.25, Khurana et al.26, Steinberg et al.27 and
evolutionary information in EvoTol20. The baseline neu- Shihab et al.28. However, it should be noted that Shihab
ROC curve
(Receiver operating tral expectation is compared with the observed counts et al.28 found that the most informative feature was the
characteristic curve). A visual of loss‑of‑function variants in the Missense Z-score21, population-level data captured by the Missense Z-score21.
and quantitative method of in Bartha’s probability of haploinsufficiency (Phi)23
evaluating the performance of and in Lek’s probability of loss‑of‑function intolerance Characteristics of essential genes
binary classifiers. The true
positive rate of a classifier is
(pLI)15. Finally, recent work by Cassa et al.24 describes a The use of human population data to identify genes
plotted against the metric (shet) that provides Bayesian estimates of the selec- that appear to be essential in vivo invites the com-
false-positive rate. tion coefficient against heterozygous loss‑of‑function parison to essential genes described in other settings.

2 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Table 1 | Metrics of gene essentiality for human population sequence data


Score Residual EvoTol Missense Probability Probability of LoFtool Selective effects
variation Z‑score of haplo­ loss‑of‑function for hetero­
intolerance insufficiency intolerance (pLI) zygous protein-
score (RVIS) (Phi) truncating
variants (shet)
Sample size 6,503 exomes 1 million variants 1,000 Genomes 10,500 exomes 60,706 exomes 60,706 exomes 60,706 exomes
(ESP), updated to from dbSNP Project data (ESP + UK10K + from ExAC from ExAC from ExAC
60,706 on their to calculate  CoLaus),
website the baseline autosomes
mutational only, updated
rate of de novo for 60,706
mutation; ESP exomes
for the observed
variant counts
Method Residuals derived Residuals derived Z‑Score to Posterior Posterior Heuristic, builds Bayesian
from the linear from the linear quantify the probabilities probabilities from on EvoTol20 estimation of
regression of regression of difference from a a three-state and mutation the selection
the number the number between the two-state Poisson mixture model from coefficient on
of common of common observed Poisson mixture model the Missense heterozygous loss
functional functional missense model Z-score21 of function
variants on the variants on the variants and
total number of total number of the expectation
variants variants based on a
mutation model
Distinctive Introduces Combines Uses a neutral Uses a mixture Unprecedented Non-parametric Direct estimation
feature the concept of intraspecies and mutation model model to sample size combination of the selection
applying human interspecies as a baseline estimate haplo­ of mutation acting on the gene
population information in a insufficiency rates, functional
genetics to similar framework predictions and
assess function, to RVIS variation data
pathogenicity and
essentiality
AF filter >0.1% NA Singleton <1% <0.1% NA <0.1%
Variant class SNVs only; SNVs only; Missense Stop-gain or Stop-gain, Stop-gain, Stop-gain, splice
missense, considered to frameshift splice, frameshift splice, (VEP-LOFTEE)
stop-gain, be functional (SnpEff) (VEP-LOFTEE) frameshift
stop-lost, based on fathmm (VEP-LOFTEE)
prediction of prediction
splice-site effects
as provided by ESP
Reported 58% against ROC shown, no 87% against 76% against Not reported 86% against OMIM not
precision OMIM disease AUC reported, de novo OMIM haplo‑­ de novo reported, 93%
against OMIM genes, 78% outperforms RVIS OMIM haplo‑­ insufficient OMIM haplo-­ to identify
genes (ROC/ against OMIM insufficient genes insufficient inheritance mode
AUC) haploinsufficient genes genes on 459 genes
genes or carrying involved in clinical
de novo variants, disease
80% against
de novo OMIM
haploinsufficient
genes
Functional High fraction of Nuclear Higher Z‑score Fewer Highly expressed Enriched in Developmental
and clinical developmental receptors and in genes paralogues; genes; depleted genes expressed pathways and
correlates genes in the top metabolic genes with de novo more central in in eQTL hits; preferentially in transcriptional
quartile are intolerant loss‑of‑function protein–protein enriched in the brain regulators are
to predicted mutations in interaction GWAS hits; more enriched in high
damaging autism spectrum networks; protein–protein shet values. Positive
variants disorders or enrichment interactions; correlation with
intellectual in genes with gene-set protein–protein
disability cases lethal mouse enrichment interaction count
knockout analysis results
phenotype, (spliceosome,
more likely to ribosome and
be in protein proteasome); 50%
complexes; of assessed human
enriched in orthologues of
genes for which mouse genes
loss of function with conditional
leads to loss of lethal knockout
cell viability; phenotype are
more conserved essential

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 3


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Table 1 (cont.) | Metrics of gene essentiality for human population sequence data
Score Residual EvoTol Missense Probability Probability of LoFtool Selective effects
variation Z‑score of haplo­ loss‑of‑function for hetero­
intolerance insufficiency intolerance (pLI) zygous protein-
score (RVIS) (Phi) truncating
variants (shet)
Interpretation Lower Lower Higher Higher Higher Lower value =  Higher
value = more value = more value = more value = more value = more more intolerant value = more
intolerant intolerant intolerant intolerant intolerant intolerant
Thresholds for Standard normal; Percentiles; the Standard normal Bimodal on Bimodal on [0,1]; Percentile; the Inverse Gaussian
pathogenicity the authors authors evaluate [0,1]; the the authors authors evaluate
evaluate the the lowest authors evaluate the set the lowest
lowest quartile quartile evaluate the set >0.9 quartile
>0.95
Refs 19 20 21 23 15 22 24
AUC, area under the curve; AF, allele frequency; CoLaus, Cohort Study of Lausanne; dbSNP, The Single Nucleotide Polymorphism database; eQTL, expression
quantitative trait locus; ESP, Exome Sequencing Project; ExAC, Exome Aggregation Consortium; fathmm, functional analysis through hidden Markov models;
GWAS, genome-wide association study; NA, not applicable; OMIM, Online Mendelian Inheritance in Man; ROC: receiver operating characteristic; SNV,
single-nucleotide variant; UK10K, the United Kingdom 10,000 Genomes Project; VEP-LOFTEE, Loss‑Of‑Function Transcript Effect Estimator plugin for the Ensembl
Variant Effect Predictor (VEP).

Indeed, multiple systems have been developed to iden- Below, we emphasize the correspondence between
tify essential genes in prokaryotic or eukaryotic mod- in vitro screens in human cell lines, mouse models,
els, in simple or complex forms of life, and in in vivo or and in vivo human genetic population data. The reader
in vitro settings. The Database of Essential Genes (DEG)5 needs to consider the profound experimental differ-
compiles in version 14.7 a total of 53,394 essential genes ences of what is tested for, as well as the implications
and 786 essential non-coding sequences among bac­ for what is deemed to be essential across these disparate
teria (18,285 essential elements), archaea (519 essential approaches (TABLE 2).
elements) and eukaryotes (34,590 essential elements).
Pioneering studies in yeast allowed the systematic Human cell lines. There have been several CRISPR–
deletion of all genes individually. Giaver et al.29 estimated Cas9 studies38–40 that targeted every gene in human cell
that approximately 20% of the genes were required for via- lines and scored for cell viability. Overall, these in vitro
bility. These encode core components of the cell, such as screens identified components of fundamental path-
proteins involved in transcription and metabolism. Those ways that are expressed at high levels. They also identi-
essential components are largely shared with other eukar- fied enrichment for genes involved in RNA processing.
yotic forms of life, such as Caenorhabditis elegans. RNA Work by several groups, and encompassing many cell
interference (RNAi) screens in the worm indicate that lines, showed a high degree of overlap in the pattern of
>50% of worm essential genes are orthologous to yeast enrichment but also revealed differences specific to each
essential genes30. All the studies, whether in yeast, worms, cell line that may reflect the developmental origin, onco-
mice or, more specifically, in humans, describe essential genic drivers, paralogous gene expression patterns and
genes as located in critical places in the core develop­ chromosomal structure of each cell line40. One impor-
mental, metabolic and signalling pathways. Specifically, tant caveat of cell-based assays for gene essentiality is
compared with other genes, essential genes are described that they address the biology of tumour cell lines and
as having more protein–protein interaction partners, generally test for complete knockouts. FIGURE 1 illustrates
being more centrally located in protein–protein networks, the correlation between essential genes identified in cell-
being broadly and strongly expressed and depleted in based assays and those identified by in vivo genetic stud-
expression quantitative trait loci (eQTLs), having higher ies using the scores presented in the previous section. It
protein abundance, being more conserved between spe- is apparent from this analysis that assays on tumour cells
cies, having significantly fewer paralogous genes in a largely agree on the sets of genes considered essential
genome, and being associated with developmental path- in vitro, in particular for a given laboratory. However,
ways and embryonic lethality 23,24,31–35. In particular, the the cell-based results do not share the same core set of
Expression quantitative role of essential genes in protein–protein interaction net- essential genes (FIG. 2) with those identified in vivo in
trait loci works36 has received considerable experimental support37. human population studies.
(eQTLs). Loci where variation is Proteins are classified as ‘indispensable’ based on their The observed differences between in vitro screens
associated with differential
impact on the network upon removal of the specific pro- and in vivo population studies may reflect different
expression of a gene.
tein. Recent analysis of the interaction network of 6,339 requirements for tumour cell versus organismal viability.
Haploid human proteins and 34,813 interactions identified 21% of They may also reflect ploidy differences because some
Of cells, containing a single set the proteins in the network as indispensable33. The in­­ of the investigated tumour cell lines are fully or partially
of chromosomes. dispensable proteins often harbour disease-causing var- haploid (KBM7), where a single CRISPR–Cas9 deletion

Ploidy
iants and are targets of drugs, which made the authors results in a knockout, whereas other cell lines are dip-
The number of sets of state that altering the network properties is critical for the loid or near-diploid (Raji, Jiyoye, HCT116 and RPE1),
chromosomes in a cell. transition between healthy and disease states33. hypotriploid (K562), or of greater ploidy complexity

4 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Hart DLD1 loss‑of‑function variants in essential genes may also


Hart GBM be lethal in mice34,44,45. By contrast, human population
Hart RPE1 analyses predominantly score on loss of one allele or
Hart HeLa on depletion of variants in a gene. This implies that
Hart HCT116 human carriers may have severe phenotypes that lead to
Cell viability data exclusion from adult cohorts such as ExAC.
Blomen KBM7
Wang Raji Georgi et al.34 characterized the patterns of genetic
Wang Jiyoye variation of 2,472 human orthologues of known essen-
Wang K562 tial genes in the mouse. Consistent with the action of
Wang KBM7 purifying selection, these genes are characterized by
Phi reduced levels of sequence variation, skewed towards
pLi more rare variants, and increased conservation across
In vivo data s_het the primate and rodent lineages relative to the remainder
missense-Z of genes in the genome. A list of 593 recently reported
LofTool essential genes in mice46 combined with null alleles from
RVIS the Mouse Genome Informatics (MGI) database results
in 3,326 essential genes that have orthologues in the
RVIS
LofTool
missense-Z
s_het
pLi
Phi
Wang KBM7
Wang K562
Wang Jiyoye
Wang Raji
Blomen KBM7
Hart HCT116
Hart HeLa
Hart RPE1
Hart GBM
Hart DLD1
human genome. Comparison of these genes to human
genes ranked by their essentiality scores indicates an
overlap of 932 essential genes (FIG. 2). This core set of
932 genes shows a modest enrichment for functions in
1.0 0 –1.0 transcription, regulation of gene expression and nucleo­
Correlation In vivo data Cell viability data tide metabolism, as well as for autosomal dominant
Figure 1 | Rank correlation of essential gene sets across human population data disease genes.
and cell-based CRISPR–Cas9 screens. The essentiality metrics Nature Reviews
residual | Genetics
variation There are also discordant gene sets between mouse
intolerance score (RVIS) , LoFtool , Missense-Z , shet (REF. 24), probability of
19 22 21 and human data: 2,394 genes are lethal in knockout mice
loss‑of‑function intolerance (pLI)15 and probability of haploinsufficiency (Phi)23 are but are not scored as essential in humans (FIG. 2). The
compared to cell viability data from studies of Wang86, Blomen38 and Hart39. EvoTol20 is considerable size of the discordant sets underlies dif-
not depicted due to the lack of availability of updated scores using the Exome ferences between synthetic experiments and data from
Aggregation Consortium (ExAC) sample. There is a high degree of consistency internal to outbred human populations. For this discordant set, a
the in vitro screens, mostly within a given laboratory. Of note, the human population-­
reason for the discrepancy is also the lack of power of the
genetics-based scores were calculated on the same data source of 60,706 samples from
bioinformatics scores: short genes and recessively acting
ExAC15. See Supplementary information S1 (box) for data sets and code needed to
generate the figure. genes may go undetected. There are also 1,282 genes that
score as essential in the human population. Of these, 495
have been evaluated in mice; some of these genes may
(GBM514, HeLa and DLD1). A gene may also be needed also be considered essential in mice when accounting
in a cell line for the culture conditions tested, whereas for loss of fitness rather than merely a binary viable
an essential gene in vivo could be needed in any impor- versus nonviable classification of essentiality. Another
tant tissue at any stage of development. Finally, the 787 are scored as essential in vivo in humans but have
main signal from human population data derives from not been tested or reported in knockout mice. Because
the identification of haploinsufficiency and, to a lesser of the production process for knockout mice, which
extent, from the contribution of null states due to homo­ until the advent of genome editing included the use
zygous loss‑of‑function and biallelic inactivation. This of a male embryonic stem cell line, the vast majority of
is an important consideration when comparing in vivo haplo­insufficient essential genes would simply be missed
results with those from near-haploid screens in KBM7, (those that are haplolethal) and indistinguishable from a
a cell line that is already, at baseline, in a state of hemi­ technical failure. These considerations notwithstanding,
zygosity and only scoring major effects when genes are there is general agreement regarding the pathways that
fully knocked out. are enriched in or depleted of essential genes in humans
and mice. This supports the notion that the in vivo con-
Knockout mice. The mouse is the most commonly ditions screened in the mouse model can reflect human
used model organism in human disease research41. consequences with respect to loss of function in those
Roughly 30% of genes in the mouse genome may be genes, despite the notable differences between null data
necessary for survival to adulthood42. However, the use- in mice (dominant and recessive) and haploinsufficiency
fulness of mouse models has been questioned for their data in humans (dominant only).
recapitu­lation of human conditions43. It is important to
emphasize what is scored in mice: the impact of bial- The core set of essential genes in mice and humans.
lelic inactivation on embryonic lethality. This implies From the discussion above, it follows that there are
that mice that were crossed in these experiments car- important considerations when comparing essen-
Hemizygosity
ried heterozygous loss‑of‑function genotypes and were tial gene sets defined by screens and metrics that test
The absence of one copy of a viable and fertile, although they might have had a clin- fundamentally different systems. The requirements of
gene in diploid cells. ical or laboratory phenotype. However, heterozygous tumour cells of variable ploidy, the viability of knockout

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 5


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Table 2 | Differences between studies of gene essentiality in the human population, human cells and mice
Human adult population Human cell lines Mouse models
Source of data Large-scale population sequencing projects Cancer cell lines (KBM7, Raji, Jiyoye, HCT116, Knockout mice from the MGI
(for example, ExAC) RPE1, K562, GBM514, HeLa and DLD1) database
Test Rare heterozygous loss‑of‑function variants* CRISPR–Cas9‑mediated knockout (haploid, Biallelic inactivation
diploid or complex ploidy cell lines)
Metric Essentiality scores (RVIS, Missense Z‑score, Cellular viability Embryonic lethality
EvoTol, Phi, pLI, LoFtool and shet)
Functional and Haploinsufficiency; dominant effects Dominant and recessive effects Recessive effects
genetic model
Additional Biallelic inactivation due to rare The requirements for viability of the The mice that were crossed
considerations loss‑of‑function variants is rarely scored specific tumour cell depend on the culture in these experiments carried
or observed. Common homozygous conditions tested. An essential gene in vivo heterozygous loss‑of‑function
loss‑of‑function variants are unlikely to be may be required in any important tissue at genotypes but were viable and
associated with fitness consequences. any stage of development. fertile.
*Can be extended to testing of depletion of non-synonymous variants. ExAC, Exome Aggregation Consortium; MGI, Mouse Genome Informatics; Phi, probability of
haploinsufficiency; pLI, probability of loss‑of‑function intolerance; RVIS, residual variation intolerance score.

mice bred from heterozygous animals, and the obser­ assembly), CBWD3 and CBWD5 (cobalamin synthase
vation of heterozygous genotypes associated with haplo­ W-domain-containing proteins) and TRIM51 (two
insufficiency in the human population test profoundly organisms have orthologues with the human gene).
different biological conditions. There is, however, a Other genes in the set are also broadly conserved across
core set of 188 genes that are called essential by all three species: BRIX1, TANGO6, EIF5AL1, ZMAT2, ZSWIM8,
orthogonal approaches (FIG. 2). These genes represent DENND4B, TEDC1 (also known as C14orf80), ARMC7,
major cellular functions, prominently, splicing, nuclear CCDC84 and RTFDC1.
transport, nucleosome architecture, mitosis and the cell Two genes in the short list, ARMC7, CCDC84, were
cycle (FIG. 3). identified as essential in cell lines and were subjected
to dedicated functional analyses39. CCDC84 exhibited
Essential genes of unknown function in humans. The enriched nuclear staining and interacted with different
design of the minimal genome of Mycoplasma mycoides sets of proteins that are predicted to participate in mRNA
(JCV‑syn1.0) led to the unexpected observation of splicing, and it thus may be a component of the PRPF
149 genes that could not be assigned to a specific bio­ splicing complex. ARMC7 co‑immuno­precipitated with
logical function3. Of these, 55 have completely unknown the poorly characterized protein RBM48. The investi-
functions, suggesting the presence of undiscovered func- gators indicate that RBM48 contains an RNA-binding
tions that are essential for life. Dey et al.47 investigated motif, and ARMC7 is an Armadillo-repeat containing
the “dark matter of the human protein-coding genome”, protein, which was interpreted as RBM48–ARMC7
which they estimated as being composed of more than being an essential protein complex with a role in RNA
6,000 poorly studied genes. By profiling the entire metabolism and/or transcription. Both genes were
human protein-coding genome across 177 eukaryotic found to be amplified across several cancer tissues and
species by using phylogenetics, they predicted the func- cell lines39.
tions for hundreds of the poorly characterized genes. As this Analysis article went to press, Cassa et al.
As an approximation of the identification of essential also described a promising set of essential genes that
genes of unknown function, we assessed how many of lack functional or disease associations24. These exer-
the most essential genes (in the top 5% of essentiality in cises for identifying essential uncharacterized genes are
either the human population genetic scores or the cell- not exhaustive and may be confounded by the legacy of
line-based metrics) returned no articles in PubMed. previous gene names or by paralogues. However, these
This approach rendered a list of 19 expressed essential analyses serve to highlight the opportunities to delve
genes of uncharacterized function. These genes include into the issue of essential genes of unknown function in
PCNX1 (an evolutionarily conserved transmembrane the human genome48.
protein similar to the pecanex protein in Drosophila
melanogaster), TBC1D3C (which belongs to a cluster of Pathogenic variants in essential genes
related genes that may be involved in GTPase signal- In 2012, MacArthur et al.10 applied stringent filters
ling and vesicle trafficking), POLR3H (an uncharacter- to 2,951 putative loss‑of‑function variants obtained
ized subunit of RNA polymerase III that is conserved from 185 human genomes to determine their true
across species), MRPL57 and MRPS14 (proteins that prevalence and properties. This study concluded that
belong to undetermined ribosomal subunits that seem human genomes typically contain approximately 100
specific to animal mitoribosomes), DDX55 (a putative loss‑of‑function variants. Their prediction implies
RNA helicase that may be involved in a range of nuclear that up to 20 genes could be completely inactivated
processes including translational initiation, nuclear and by homozygous loss of function in each individual.
mitochondrial splicing, and ribosome and spliceosome However, genes carrying common loss‑of‑function

6 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Essential in human populations


• Whole organism might need functions
not needed for simple cellular systems 1,894
• Enriched in dominant disease genes 1,894
744
744
1,059
485
Essential in knockout mice 188 188
• Enriched in recessive disease genes
223 500 10 500
Subtract genes for
which no knockout 45
Essential in human cell lines mice have been
• Cell lines are not representative of produced
1,098
whole organisms
• Enriched in recessive disease genes

Total: 5,706 Total: 3,866

Figure 2 | Consistency of essentiality calls across human in vivo, mouse in vivo, and CRISPR–Cas9 cell line data sets.
Venn diagrams depicting the 3,326 mouse genes identified as essential by Dickinson et al.46 and theNature Reviews | Mouse
International Genetics
Phenotyping Consortium (yellow); the top 15% human essential genes as defined by RVIS, pLI, Phi, missense Z‑score,
LoFtool and shet from human population exome sequence data (red); and essential genes as identified by CRISPR–Cas9
screens on human cell lines (blue). The total number of genes marked as essential by any of the methods is 5,706
(left panel). Not all genes may have been scored in all three systems, in particular in the mouse models; hence, the right
panel only shows genes for which mouse knockouts have been generated. The nature of the 188 genes in the intersection
(that is, those scoring as essential in all three systems) is depicted in FIG. 3.

variants were observed to be less central to key cellu- of synonymous variation across various levels of gene
lar pathways and to be strongly enriched for functional essentiality are consistent with the neutral model, where
categories, such as olfactory reception, that are not synonymous variants are randomly distributed across
critical for fitness and survival. At the other extreme the spectrum of essentiality. It is also of note that when
of the spectrum, they observed heterozygous carriers of the same analysis is done with data from CRISPR–Cas9
rare variants for numerous recessive disease genes. As screens, the depletion of rare variants occurs only at the
described above, the current scale of genome and exome highest scores of in vitro loss of viability. This result can
data, as well as the new metrics for essentiality, support be interpreted as an indication that only the most severe
a reassessment of the disease consequences of variants in vitro cellular phenotypes correlate with the in vivo
in essential genes. human population data. This is consistent with the
The analysis by Cassa et al.24 of heterozygous protein-­ study design: of all the genes for which biallelic loss‑of‑­
truncating variants in over 60,000 ExAC individuals function variants impair viability based on in vitro
used the shet score. The highest shet values predict pheno- screens, the mildest effects are expected to be recessive
typic severity, age of onset and penetrance for Mendelian (no noticeable effect when heterozygous), whereas the
disease-associated genes. In addition, genes involved in most severe could be expected to be haplo­insufficient
neurological phenotypes, including autism, congenital (organismal impairment when hetero­zygous, enabling
heart disease and inherited cancer risk, seem to be under their discovery by the human population sequencing
more intense selection, that is, more essential. Overall, approaches). Overall, there are 395 diseases that are
quantitative estimates of essentiality appear particularly associated with genes that are essential in the human
useful in Mendelian disease gene discovery efforts. There population and 681 diseases associated with genes that
is emerging data suggesting that loss‑of‑function var- are essential in mouse models or cell line CRISPR–Cas9
iants are more frequently observed in younger people, screens. The list of diseases associated with essen-
supporting the notion that they may limit healthspan49. tial genes is included in Supplementary information 
Among over 10,000 deeply sequenced genomes at S3 (table).
Human Longevity Inc14., we have observed 505,906 puta-
tive loss‑of‑function variants, representing an average Consequences of biallelic inactivation
of 40.7 per individual genome. In addition, there are, As discussed, most of the signal of essentiality in humans
on average, 0.6 pathogenic or likely pathogenic ClinVar is derived from the observation of rare heterozygous
variants and 3.9 high-confidence disease-causing muta- loss‑of‑function variants. However, there are also
tions in the Human Gene Mutation Database (HGMD) common (high allele frequency) loss‑of‑function var-
per individual. Essential genes are enriched for patho­ iants that effectively result in null individuals through
Compound heterozygosity genic variants that are annotated in human genetic homozygosity or compound heterozygosity. Loss of those
The state in which both alleles
of a gene carry a (deleterious)
variant databases (ClinVar and HGMD)50,51 (FIG. 4). The genes is likely to have weak or neutral effects on fitness
variant, but those variants are enrichment is magnified by the parallel depletion of rare and health because they are observed in presumably
different. missense variants in those genes (FIG. 4). The stable rates healthy adults10,52–54. ExAC, which comprises data from

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 7


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

WDR1

TLN1
PDPK1
MAPKAP1 ATP5A1
SOS1 PPRC1 ATP5B
RICTOR
HSPA5 ZBTB17 BPTF GABPA
MTOR
MYH9 NRF1
RPTOR HSPD1 NCKAP1
VPS35
ACTG1 ACTR3 HMGCR
EIF3B SP1 SMARCA5 NRBP1
INO80 BRD4
PTPN11 SMG1 RUVBL2 SAP130
TRRAP MED12
UPF1 YY1 ACTL6A DNMT1
UPF2 RAF1 CCT3 CHAF1A
MED1
CASC3 TRIM28
ACTB SRF
PPP2CA
SHOC2
INTS2 SIN3A
SMARCA4
PPP1CB SMARCB1
PPP2R1A CHD4
INTS6
NPLOC4 PRMT5 GTF2A1
VCP
COPS5 COPS2
CUL1 GPS1
SFPQ
CNOT9 CUL3 PSMC3 POLR1B

KIF11 CDK1 WEE1 DDB1


DYNC1H1 PKM
EXOC8 ANAPC2 G6PD
PLK1 CKAP5 PRMT1
DCTN1 ESPL1 WAPL POLR2A
PAFAH1B1 CTCF DGCR8
SMC3 PDS5A DROSHA
LIN9 CTR9 CDC73
NUP85
RAD21 SRSF3 TOP1 RTF1 MEN1
NUP155 HNRNPU
NIPBL SSRP1 SETD1A
RANBP2 SRSF1
TPX2 HNRNPL SF3B1
KPNB1 RAE1 SRSF7 SF3A1 XAB2 KMT2D
NUP98 SNRNP200 DHX9
ILF3 PRPF19 RBM39
EEF2
PRPF3 Cell cycle arrest
and/or DNA repair
RAD51 GTF3C1 XRCC5 Microtubule and/or mitosis
AP2M1 SYMPK DICER1
UBR4 Nuclear transport
DDX5 SRRT Other
COPG1 RBBP6 REV3L SF1 AGO2
Splicing
Transcription,
nucleosome, epigenetics

Figure 3 | Core set of essential genes in mice and humans. A total of 188 genes are scored as essential
NatureinReviews
human population
| Genetics
studies, human cell-line CRISPR–Cas9 screens and knockout mice. These genes represent major cellular functions (inset).
Genes that are part of Gene Ontology gene sets that are significantly enriched in the core essential genes are colour coded
according to major cellular functions. The figure is built using STRING DB experimentally validated and database-derived
interactions. See Supplementary information S1 (box) for the data sets and code needed to generate the figure.

predominantly outbred populations, identified 1,775 the study of low-frequency loss‑of‑function variants
genes with predicted biallelic loss of function in 60,706 as well as Mendelian diseases. A study of 3,000 Finns
individuals15. In the Icelandic population studies, 1,171 described a significant enrichment of low-frequency
genes were predicted to be null among 104,220 indi- (0.5–5%) loss‑of‑function variants56. Simulation based
viduals55. A study of 3,222 British adults of Pakistani on these data suggests that although deleterious vari-
heritage with high parental relatedness identified 1,111 ants are inevitably pushed to extremely low frequency
homozygous genotypes with predicted loss of function after 1,000 or more generations, they can easily persist
in 781 genes54. Although the latter study focused on at frequencies between 0.1 and 1% up to 100 generations
rare variants, it did not record adverse consequences after a bottleneck56. The latest published study of 10,503
of the loss of function, as assessed by rates of medical adult participants in the Pakistan Risk of Myocardial
consultation and drug prescription54. The population Infarction Study 57 identified 49,138 rare loss‑of‑­function
bottleneck of the Finnish founder population facilitates variants estimated to knock out 1,317 genes, each in at

8 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Synonymous Missense Pathogenic (n = 273). This includes eight genes (HSPD1, DNM2,
CASC3, ABCB7, NUP85, LIN9, BPTF and UBR4) that
Synonymous or missense variants per kb [10K individuals]

40 In vivo 6 are deemed to be essential by all three systems and mod-


els. However, there is evidence that the observed homo­

Pathogenic variants per kb [ClinVar and HGMD]


zygous variants in these eight genes do not contribute to
actual loss of protein function: variants in those genes are
35
5 positioned in the last exon or in non-conserved exons or
may represent sequencing errors.
In vitro It is thus important to exclude truncating variants of
30 limited functional impact. The assessment of putative
4 loss‑of‑function variants, whether rare or common,
requires attention to the percentage of sequence affected,
25 loss of functional domains, proportion of isoforms
affected, principal isoform damage, and degra­dation
3 of the transcript by nonsense-mediated mRNA decay
(NMD)8,11,58,59. Specifically, there is depletion of com-
20
mon loss‑of‑function variants in principal isoforms and
In vivo
in regions more than 50 nucleotides upstream of the last
In vitro 2 exon–exon junction8. Other putative loss‑of‑function
0.00 0.25 0.50 0.75 1.00
variants that may be excluded are those in non-­canonical
splice sites or exons flanked by non-canonical splice
Low High
Essentiality sites, splice-site mutations at small (<15 bp) introns,
and where the purported loss‑of‑function allele is
Figure 4 | Pathogenic variants in essential genes. For the left axis, the plot uses variants
Nature Reviews | Genetics observed across primates57. The program LOFTEE
from over 10,000 deep-sequenced genomes9. Depicted are the distributions of rare (allele
frequency <0.001) synonymous (grey; N = 494,641) and missense (red; N = 887,534) variants
(Loss‑Of‑Function Transcript Effect Estimator), which is
throughout the essentiality spectrum. Red and grey solid lines are computed with a plugin for the Ensembl Variant Effect Predictor (VEP)
essentiality scores using human population data; dotted lines are computed with the implements these practices. Finally, manual curation
essentiality scores obtained with CRISPR–Cas9 screens in vitro. The right axis depicts the may be required to exclude variants with few or poorly
distribution of pathogenic variants (purple; N = 113,951) obtained from public databases supported sequencing reads, and for any given putative
(ClinVar and the Human Gene Mutation Database (HGMD)). It shows a significant loss‑of‑function variant, experimental validation will be
enrichment in pathogenic variants in essential genes. The more essential a gene is, the less required to prove loss‑of‑function57. In summary, just as
likely it is to tolerate missense variation but the more likely that this variation results in there is a case for an essential genome of around 3,000
pathogenicity. By contrast, the distribution of rare synonymous variants (grey) across the genes, there is also substantial evidence to support a case
essentiality spectrum is flat, which is consistent with a neutral model, that is, the absence
for a 3,000‑gene or larger dispensable genome.
of purifying selection.
Essentiality of the non-coding genome
There is increasing awareness that genetic variants,
least one participant. The study explored in detail the including single-nucleotide and structural and copy
health consequences of null phenotypes of PLA2G7, number variants, in the non-coding regions of the human
CYP2F1, TREH, A3GALT2, NRG4, SLC9A3R1, and genome can play an important role in human traits and
APOC3 (REF. 57). These studies are converging on a diseases. In fact, most of the genome-wide association
catalogue of human loss‑of‑function variants. Projects study (GWAS) variants map to non-coding regions
such as The Broad Institute’s Human Knockout Project (reviewed in REF. 60), and there are increasing numbers
will leverage previously generated and new functional of reports of Mendelian disorders that map outside the
genomic data to validate a subset of these variants. It is protein-coding genome61–64. Understanding the relation-
expected that many instances of biallelic inactivation will ship between essentiality, genetic constraint of regulatory
require genotype-based recall of the individuals for deep elements, and pathogenicity would be important for
pheno­typing to gauge the functional consequences53. progress in understanding the non-coding genome.
We analysed the nature of genes that tolerate a puta- Can the idea that natural variation saturates the
tive null state due to homozygous loss‑of‑function var- genome at non-essential regions extend to non-­protein-
iants in the adult human population. There are 3,296 coding and regulatory regions? There are two ortho­gonal
genes in ExAC with at least one homozygous features of coding regions that are not available in the
loss‑of‑function variant. The median allele frequency is non-coding genome. First, the non-coding genome lacks
greater than 1 per 1,000. As expected, this set of genes strong landmark features, which could serve as a unit
Nonsense-mediated mRNA is enriched for drug metabolism genes and olfactory similar to an exon. Second, certain mutational classes
decay receptors and scored for low essentiality. These data sup- in the coding genome (for example, protein-truncating
(NMD). A cellular pathway that port the concept of a ‘dispensable genome’ as profiled variants) are easier to score than those in the non-­coding
serves to recognize and by the various publications summarized in the previous genome. However, recent advances in understanding
degrade mRNAs with
translation termination codons
para­graph. However, there is some overlap with genes that conservation65 and CRISPR–Cas9 screening 66 of the
that are positioned in are scored as essential in the human population (n = 117), non-coding genome supports extending the concept of
abnormal contexts. in CRISPR–Cas9 cell lines (n = 159) and in knockout mice essentiality beyond the human exome.

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 9


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

The functional characterization of the non-­coding locations that modulate drug resistance in melanoma.
genome has been guided prominently by bio­chemical Zhu et al.77 reported a CRISPR–Cas9 screen of 700
data as represented by the Encyclopedia of DNA long non-coding RNAs (lncRNAs). They identified
Elements (ENCODE) project 67 and by scores that 51 lncRNAs that can positively or negatively regulate
serve to predict the pathogenicity of genetic variants. human cancer cell growth. At present, the non-coding
Prominent among these scores are CADD68, DeepSEA69 genome CRISPR–Cas9 screens are regional and, based
and Eigen 70. CADD, the Combined Annotation- on their experimental readouts, may not be considered
Dependent Depletion score, uses evolutionary infor­ actual screens of essentiality. In the future, genome-wide
mation (human–chimpanzee fixed or nearly fixed recent screens will generate data on hallmarks of essentiality
evolutionary changes), annotation from Ensembl VEP, in the non-coding genome. Much as we have reviewed
data from the ENCODE project and information from the essentiality of different regions of the protein-coding
UCSC genome browser tracks (sequence conservation genome, it will be useful to compare data from CRISPR–
scores from GERP, phastCons and phyloP; functional Cas9 in vitro screens with human population data of
genomics data such as DNase hypersensitivity and tran- essentiality in the non-protein-coding genome.
scription factor binding; transcript information includ-
ing distance to exon–intron boundaries or expression Conclusions
levels in commonly studied cell lines; and protein-level This article highlights approaches for the identification
scores such as Grantham, SIFT and PolyPhen) 68. of genes that can be considered essential in humans,
DeepSEA69 was developed to directly learn a regulatory and emphasizes the importance of rare loss‑of‑­function
sequence code from large-scale chromatin-profiling variants leading to haploinsufficiency. Evaluating gene
data. DeepSEA enables the prediction of the chromatin essentiality by leveraging population genomics and
effects of sequence alterations with single-nucleotide empirical model systems goes beyond the study of gene
sensitivity to prioritize functional variants including function and redundancy. For clinical medicine, these
eQTLs and disease-associated variants69. More recently, essentiality metrics serve to support improved attri­
Eigen was developed as an unsupervised approach for bution of pathogenicity to variants. For drug develop­
integrating these different annotations into one meas- ment, the metrics provide an in vivo glimpse on the
ure of functional importance70. Eigen is dependent on impact of natural, lifelong loss‑of‑function variants
annotation but independent of any labelled training when present in a heterozygous or homozygous state78.
data. There are additional computational methods to One study 54 indicated that drugs that target genes
prioritize non-coding variants with functional effects that tolerate rare homozygous loss‑of‑function have a
(reviewed in REF. 71). Overall, these scores support var- greater registration approval rate compared to all other
iant prioritization for pathogenicity by using extensive targets. This would suggest that it is easier or safer to
evolutionary, biochemical and functional data. target mutation-tolerant genes than essential, intoler-
More recently, the sequence context has been used to ant genes. For example, development of C‑C chemokine
study polymorphisms at the genome level. Specifically, receptor type 5 (CCR5) antagonists, for the inhibi-
the heptanucleotide context explains >81% of vari­ tion of human immunodeficiency virus (HIV) entry
ability in substitution probabilities and was used to into cells, was supported by the lack of adverse con-
derive substitution intolerance scores for genes and a sequences of a natural truncation (CCR5Δ32) variant
new intolerance score for amino acids72. We have devel- that is observed in up to 3% of the human population79.
oped a similar approach to the nucleotide structure of A second scenario is exemplified by the identification
the non-coding genome to define patterns of constraint of heterozygous truncating variants of the gene encod-
and essentiality. Here, the heptameric structure of the ing cytotoxic T lymphocyte antigen 4 (CTLA4). Loss
genome defines a context-dependent tolerance score of this inhibitory receptor on immune cells leads to
(CDTS)73 that identifies the constrained regions in the severe immune dysfunction through dysregulation
human non-coding genome. Different from the scores of FOXP3+ regulatory T cells and hyperactivation of
described above, CDTS is derived uniquely from human effector T cells80. CTLA4 is one of the current targets
population data and does not use any a priori knowledge for immune-checkpoint-blocking antibodies in cancer.
on classes of variation. The observation of disease in long-term CTLA4 defi-
In parallel with human genomic studies, the ciency in natura describes the possible spectrum of drug
non-coding genome is increasingly studied by using toxicity that can be anticipated from CTLA4-blocking
CRISPR–Cas9 screens. Fulco et al.74 assessed >1 Mb of drugs80. A third scenario illustrates aspects of balancing
sequence near two essential transcription factors, MYC the expectation from natural impacts of null variants,
and GATA1, and identified nine distal enhancers and with the outcome of clinical trials that antagonize the
repressors that control gene expression. Korkmaz et al.75 target gene. A recent study of evolocumab, a mono­
targeted Cas9 to transcription factor binding sites in 758 clonal antibody that inhibits proprotein convertase
enhancer regions. Specifically, the study characterized subtilisin–kexin type 9 (PCSK9) and lowers low-density
the role of functional enhancer elements in mediating lipoprotein cholesterol levels, reported a 20% reduction
p53 and ERα gene regulation. In total, they identified in cardiovascular death over 48 weeks of treatment 81. In
six enhancer elements that potentially control cell pro- comparison, lifelong carriage of PCSK9 loss‑of‑function
liferation. Sanjana et al.76 targeted >700 kb surrounding alleles is associated with as much as 50% reduced risk of
the genes NF1, NF2 and CUL3 to identify non-coding myocardial infarction.

10 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

Haplotype phasing We have also emphasized differences between defi- identify the actual essential exons and codons in a
The assignment of an allele to nitions of essentiality of human genes in cell lines or gene82. Analyses may also include haplotype phasing,
one of the two copies of the knockout mice compared to evidence gathered from to support the identification of compound hetero­
chromosomes (maternal and human population genetic studies. Whereas various zygosity at essential genes. It is important to note that
paternal).
sets of essential genes may inform biology and guide because deleterious alleles act synergistically, purify-
drug development efforts — in particular for targeting ing selection limits linkage disequilibrium between
tumour cell viability — the reader should be aware of the rare loss‑of‑­function variants in the genome83. In the
considerable differences in what is scored as essential future, it will be informative to extend the analysis of
between in vitro and in vivo screens, in humans and in gene essentiality to cancer in vivo, as opposed to analyses
mice (TABLE 2). One approach to mitigating these funda- in tumour cell lines. For this purpose, large databases
mental differences of defining essentiality is to increase of tumour somatic variation can be studied using the
attention to the phenotypic traits of loss‑of‑function scores that have been developed for germline defini-
heterozygous mice and, more generally, to measure tion of essentiality. Finally, as the understanding of the
the biological consequences of haploinsufficiency. This hallmarks of essentiality extends to the non-coding
may also be important for drug development, as haplo­ genome, the field will be informed about the possible
insufficiency may be equated to the half maximal inhib- consequences of disruption of the essential regulatory
itory concentration (IC50), a measure of the effectiveness machinery of genes and pathways. Collectively, human73
of a drug in early preclinical development. and interspecies conservation and constraint 84, func-
Progressively, the metrics and scores used to define tional characterization67 and genetic evidence85 build a
essential genes should serve to narrow down and map of human genome essentiality.

1. Maniloff, J. The minimal cell genome: “on being the 18. Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E. & 32. Khuri, S. & Wuchty, S. Essentiality and centrality in
right size”. Proc. Natl Acad. Sci. USA 93, Kimchi-Sarfaty, C. Exposing synonymous mutations. protein interaction networks revisited. BMC
10004–10006 (1996). Trends Genet. 30, 308–321 (2014). Bioinformatics 16, 109 (2015).
2. Hutchison III, C. A. et al. Global transposon 19. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & 33. Vinayagam, A. et al. Controllability analysis of the
mutagenesis and a minimal Mycoplasma genome. Goldstein, D. B. Genic intolerance to functional directed human protein interaction network identifies
Science 286, 2165–2169 (1999). variation and the interpretation of personal genomes. disease genes and drug targets. Proc. Natl Acad. Sci.
3. Hutchison III, C. A., et al. Design and synthesis of a PLoS Genet. 9, e1003709 (2013). USA 113, 4976–4981 (2016).
minimal bacterial genome. Science 351, aad6253 20. Rackham, O. J., Shihab, H. A., Johnson, M. R. & 34. Georgi, B., Voight, B. F. & Bucan, M. From mouse to
(2016). Petretto, E. EvoTol: a protein-sequence based human: evolutionary genomics analysis of human
4. Liu, G. et al. Gene essentiality is a quantitative evolutionary intolerance framework for disease-gene orthologs of essential genes. PLoS Genet. 9,
property linked to cellular evolvability. Cell 163, prioritization. Nucleic Acids Res. 43, e33 (2015). e1003484 (2013).
1388–1399 (2015). 21. Samocha, K. E. et al. A framework for the 35. Cannavo, E. et al. Genetic variants regulating
5. Luo, H., Lin, Y., Gao, F., Zhang, C. T. & Zhang, R. DEG interpretation of de novo mutation in human disease. expression levels and isoform diversity during
10, an update of the database of essential genes that Nat. Genet. 46, 944–950 (2014). embryogenesis. Nature 541, 402–406 (2017).
includes both protein-coding genes and noncoding This is an influential paper describing context- 36. Jeong, H., Mason, S. P., Barabasi, A. L. & Oltvai, Z. N.
genomic elements. Nucleic Acids Res. 42, dependent mutation rates across the genome. Lethality and centrality in protein networks. Nature
D574–D580 (2014). It forms the basis for several sores of essentiality. 411, 41–42 (2001).
6. Deutschbauer, A. M. et al. Mechanisms of 22. Fadista, J., Oskolkov, N., Hansson, O. & Groop, L. 37. Zhang, X., Acencio, M. L. & Lemke, N. Predicting
haploinsufficiency revealed by genome-wide profiling LoFtool: a gene intolerance score based on essential genes and proteins based on machine
in yeast. Genetics 169, 1915–1925 (2005). loss‑of‑function variants in 60 706 individuals. learning and network topological features: a
7. Cirulli, E. T. et al. A whole-genome analysis of Bioinformatics 33, 471–474 (2016). comprehensive review. Front. Physiol. 7, 75 (2016).
premature termination codons. Genomics 98, 23. Bartha, I. et al. The characteristics of heterozygous 38. Blomen, V. A. et al. Gene essentiality and synthetic
337–342 (2011). protein truncating variants in the human genome. lethality in haploid human cells. Science 350,
8. Rausell, A. et al. Analysis of stop-gain and frameshift PLoS Comput Biol 11, e1004647 (2015). 1092–1096 (2015).
variants in human innate immunity genes. PLoS This study highlights rare heterozygous variants as 39. Hart, T. et al. High-resolution CRISPR screens reveal
Comput. Biol. 10, e1003757 (2014). an unexplored source of diversity of phenotypic fitness genes and genotype-specific cancer liabilities.
9. Rivas, M. A. et al. Human genomics. Effect of traits and diseases. It describes the lack of Cell 163, 1515–1526 (2015).
predicted protein-truncating genetic variants on the compensation at expression level 40. Wang, T., Wei, J. J., Sabatini, D. M. & Lander, E. S.
human transcriptome. Science 348, 666–669 (2015). (haploinsufficiency). Genetic screens in human cells using the CRISPR‑Cas9
10. MacArthur, D. G. et al. A systematic survey of 24. Cassa, C. A. et al. Estimating the selective effects of system. Science 343, 80–84 (2014).
loss‑of‑function variants in human protein-coding heterozygous protein-truncating variants from human 41. Rosenthal, N. & Brown, S. The mouse ascending:
genes. Science 335, 823–828 (2012). exome data. Nat. Genet. 49, 806–810 (2017). perspectives for human-disease models. Nat. Cell Biol.
11. Lappalainen, T. et al. Transcriptome and genome This paper describes a large set of essential genes 9, 993–999 (2007).
sequencing uncovers functional variation in humans. that are likely to have crucial functions but have not 42. Ayadi, A. et al. Mouse large-scale phenotyping
Nature 501, 506–511 (2013). yet been characterized. initiatives: overview of the European Mouse Disease
12. Montgomery, S. B., Lappalainen, T., Gutierrez- 25. Dang, V. T., Kassahn, K. S., Marcos, A. E. & Clinic (EUMODIC) and of the Wellcome Trust Sanger
Arcelus, M. & Dermitzakis, E. T. Rare and common Ragan, M. A. Identification of human haploinsufficient Institute Mouse Genetics Project. Mamm. Genome
regulatory variation in population-scale sequenced genes and their genomic proximity to segmental 23, 600–610 (2012).
human genomes. PLoS Genet. 7, e1002144 (2011). duplications. Eur. J. Hum. Genet. 16, 1350–1357 43. Justice, M. J. & Dhillon, P. Using the mouse to model
13. Huang, N., Lee, I., Marcotte, E. M. & Hurles, M. E. (2008). human disease: increasing validity and reproducibility.
Characterising and predicting haploinsufficiency in the 26. Khurana, E., Fu, Y., Chen, J. & Gerstein, M. Dis. Model. Mech. 9, 101–103 (2016).
human genome. PLoS Genet. 6, e1001154 (2010). Interpretation of genomic variants using a unified 44. Prado, A., Canal, I. & Ferrus, A. The haplolethal region
14. Telenti, A. et al. Deep sequencing of 10,000 human biological network approach. PLoS Comput. Biol. 9, at the 16F gene cluster of Drosophila melanogaster:
genomes. Proc. Natl Acad. Sci. USA 113, e1002886 (2013). structure and function. Genetics 151, 163–175 (1999).
11901–11906 (2016). 27. Steinberg, J., Honti, F., Meader, S. & Webber, C. 45. Howell, G. R., Munroe, R. J. & Schimenti, J. C.
15. Lek, M. et al. Analysis of protein-coding genetic variation Haploinsufficiency predictions without study bias. Transgenic rescue of the mouse t complex haplolethal
in 60,706 humans. Nature 536, 285–291 (2016). Nucleic Acids Res. 43, e101 (2015). locus Thl1. Mamm. Genome 16, 838–846 (2005).
This paper presents the identification by ExAC of 28. Shihab, H. A., Rogers, M. F., Campbell, C. & 46. Dickinson, M. E. et al. High-throughput discovery of
3,230 genes with near-complete depletion of Gaunt, T. R. HIPred: an integrative approach to novel developmental phenotypes. Nature 537,
predicted protein-truncating variants. This work predicting haploinsufficient genes. Bioinformatics 33, 508–514 (2016).
describes the widely used pLI score to identify 1751–1757 (2017). This is the largest study from the International
essential genes. 29. Giaever, G. & Nislow, C. The yeast deletion collection: Mouse Phenotyping Consortium. It identifies 410
16. Dewey, F. E. et al. Distribution and clinical impact of a decade of functional genomics. Genetics 197, lethal genes during the production of the first
functional variants in 50,726 whole-exome sequences 451–465 (2014). 1,751 mouse gene knockouts.
from the DiscovEHR study. Science 354, 30. Fraser, A. Essential Human Genes. Cell Syst. 1, 47. Dey, G., Jaimovich, A., Collins, S. R., Seki, A. &
aaf6814(2016). 381–382 (2015). Meyer, T. Systematic discovery of human gene function
17. Chamary, J. V., Parmley, J. L. & Hurst, L. D. Hearing 31. Dickerson, J. E., Zhu, A., Robertson, D. L. & and principles of modular organization through
silence: non-neutral evolution at synonymous sites in Hentges, K. E. Defining the role of essential genes in phylogenetic profiling. Cell Rep. http://dx.doi.
mammals. Nat. Rev. Genet. 7, 98–108 (2006). human disease. PLoS ONE 6, e27368 (2011). org/10.1016/j.celrep.2015.01.025 (2015).

NATURE REVIEWS | GENETICS ADVANCE ONLINE PUBLICATION | 11


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
A N A LY S I S

48. Edwards, A. M. et al. Too many roads not taken. 66. Wright, J. B. & Sanjana, N. E. CRISPR screens to Preprint at http://biorxiv.org/content/
Nature 470, 163–165 (2011). discover functional noncoding elements. Trends Genet. early/2017/06/12/148353 (2017).
49. Ganna, A. et al. Quantifying the impact of rare and 32, 526–529 (2016). 83. Sohail, M. et al. Negative selection in humans and
ultra-rare coding variation across the phenotypic 67. Consortium, E. P. An integrated encyclopedia of DNA fruit flies involves synergistic epistasis. Science 356,
spectrum. Preprint at http://biorxiv.org/content/ elements in the human genome. Nature 489, 57–74 539–542 (2017).
early/2017/06/09/148247 (2017). (2012). 84. Lindblad-Toh, K. et al. A high-resolution map of human
50. Landrum, M. J. et al. ClinVar: public archive of 68. Kircher, M. et al. A general framework for estimating evolutionary constraint using 29 mammals. Nature
relationships among sequence variation and human the relative pathogenicity of human genetic variants. 478, 476–482 (2011).
phenotype. Nucleic Acids Res. 42, D980–D985 (2014). Nat. Genet. 46, 310–315 (2014). 85. Kellis, M. et al. Defining functional DNA elements in
51. Stenson, P. D. et al. The Human Gene Mutation 69. Zhou, J. & Troyanskaya, O. G. Predicting effects of the human genome. Proc. Natl Acad. Sci. USA 111,
Database: building a comprehensive mutation noncoding variants with deep learning-based 6131–6138 (2014).
repository for clinical and molecular genetics, diagnostic sequence model. Nat. Methods 12, 931–934 (2015). 86. Wang, T. et al. Identification and characterization of
testing and personalized genomic medicine. Hum. 70. Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. essential genes in the human genome. Science 350,
Genet. 133, 1–9 (2014). A spectral approach integrating functional genomic 1096–1101 (2015).
52. Gudbjartsson, D. F. et al. Large-scale whole-genome annotations for coding and noncoding variants. Nat.
sequencing of the Icelandic population. Nat. Genet. Genet. 48, 214–220 (2016). Acknowledgements
47, 435–444 (2015). 71. Khurana, E. et al. Role of non-coding sequence variants The authors thank Drs Ewen Kirkness and Michael Hicks for
53. Narasimhan, V. M., Xue, Y. & Tyler-Smith, C. Human in cancer. Nat. Rev. Genet. 17, 93–108 (2016). valuable comments. The authors are employees of Human
knockout carriers: dead, diseased, healthy, or 72. Aggarwala, V. & Voight, B. F. An expanded sequence Longevity, Inc.
improved? Trends Mol. Med. 22, 341–351 (2016). context model broadly explains variability in
54. Narasimhan, V. M. et al. Health and population effects polymorphism levels across the human genome. Author contributions
of rare gene knockouts in adult humans with related Nat. Genet. 48, 349–355 (2016). All authors substantially contributed to discussion of content
parents. Science 352, 474–477 (2016). 73. di Iulio, J. et al. The human non-coding genome and to reviewing/editing the manuscript before submission.
55. Sulem, P. et al. Identification of a large set of rare complete defined by genetic diversity. Nat. Genet. (in the press) I.B., J.d.I. and A.T. researched data for the article and contrib‑
human knockouts. Nat. Genet. 47, 448–452 (2015). (2017). uted to writing the manuscript.
56. Lim, E. T. et al. Distribution and medical impact of 74. Fulco, C. P. et al. Systematic mapping of functional
loss‑of‑function variants in the Finnish founder enhancer-promoter connections with CRISPR Competing interests statement
population. PLoS Genet. 10, e1004494 (2014). interference. Science 354, 769–773 (2016). The authors declare competing interests: see Web version for
57. Saleheen, D. et al. Human knockouts and phenotypic 75. Korkmaz, G. et al. Functional genetic screens for details.
analysis in a cohort with a high rate of consanguinity. enhancer elements in the human genome using
Nature 544, 235–239 (2017). CRISPR‑Cas9. Nat. Biotechnol. 34, 192–198 (2016). Publisher’s note
This provides a roadmap for a ‘human knockout 76. Sanjana, N. E. et al. High-resolution interrogation of Springer Nature remains neutral with regard to jurisdictional
project’ to understand the phenotypic consequences functional elements in the noncoding genome. Science claims in published maps and institutional affiliations.
of complete disruption of genes in humans. 353, 1545–1549 (2016).
58. Nagy, E. & Maquat, L. E. A rule for termination-codon 77. Zhu, S. et al. Genome-scale deletion screening of FURTHER INFORMATION
position within intron-containing genes: when human long non-coding RNAs using a paired-guide Cohort Study of Lausanne (CoLaus): http://www.colaus.ch/
nonsense affects RNA abundance. Trends Biochem. RNA CRISPR‑Cas9 library. Nat. Biotechnol. 34, Database of Essential Genes (DEG):
Sci. 23, 198–199 (1998). 1279–1286 (2016). http://www.essentialgene.org
59. Lykke-Andersen, S. & Jensen, T. H. Nonsense- 78. Kathiresan, S. Developing medicines that mimic the dbSNP: https://www.ncbi.nlm.nih.gov/projects/SNP/
mediated mRNA decay: an intricate machinery that natural successes of the human genome: lessons from Exome Aggregation Consortium (ExAC):
shapes transcriptomes. Nat. Rev. Mol. Cell Biol. 16, NPC1L1, HMGCR, PCSK9, APOC3, and CETP. J. Am. http://exac.broadinstitute.org
665–677 (2015). Coll. Cardiol. 65, 1562–1566 (2015). Exome Sequencing Project (ESP):
60. Zhang, F. & Lupski, J. R. Non-coding genetic variants 79. Este, J. A. & Telenti, A. HIV entry inhibitors. Lancet http://evs.gs.washington.edu/EVS/
in human disease. Hum. Mol. Genet. 24, R102–R110 370, 81–88 (2007). Fathmm variant effect predictor:
(2015). 80. Kuehn, H. S. et al. Immune dysregulation in human http://fathmm.biocompute.org.uk/
61. Esteller, M. Non-coding RNAs in human disease. subjects with heterozygous germline mutations in Loss‑Of‑Function Transcript Effect Estimator (LOFTEE):
Nat. Rev. Genet. 12, 861–874 (2011). CTLA4. Science 345, 1623–1627 (2014). https://github.com/konradjk/loftee
62. Makrythanasis, P. & Antonarakis, S. E. Pathogenic This is a report of haploinsufficiency linked to a Mouse Genome Informatics (MGI):
variants in non-protein-coding sequences. Clin. Genet. severe immune disease in several unrelated adults http://www.informatics.jax.org
84, 422–428 (2013). that escaped diagnosis for years. It serves as a SnpEff variant effect predictor:
63. Gordon, C. T. & Lyonnet, S. Enhancer mutations and model of the syndromes to come. http://snpeff.sourceforge.net/
phenotype modularity. Nat. Genet. 46, 3–4 (2014). 81. Sabatine, M. S. et al. Evolocumab and clinical UK10K Project: https://www.uk10k.org/
64. Smedley, D. et al. A whole-genome analysis framework outcomes in patients with cardiovascular disease. VEP-LOFTEE variant effect predictor:
for effective identification of pathogenic regulatory N. Engl. J. Med. 376, 1713–1722 (2017). http://uswest.ensembl.org/info/docs/tools/vep/index.html
variants in mendelian disease. Am. J. Hum. Genet. 99, This is a clinical trial of a drug built on the
595–606 (2016). knowledge of the cardiovascular phenotype of a SUPPLEMENTARY INFORMATION
65. Harmston, N., Baresic, A. & Lenhard, B. The mystery human PCSK9 truncation. See online article: S1 (box) | S2 (table) | S3 (table)
of extreme non-coding conservation. Phil. Trans. 82. Samocha, K. E. et al. Regional missense constraint
ALL LINKS ARE ACTIVE IN THE ONLINE PDF
R. Soc. B 368, 20130021 (2013). improves variant deleteriousness prediction.

12 | ADVANCE ONLINE PUBLICATION www.nature.com/nrg


©
2
0
1
7
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

You might also like