Professional Documents
Culture Documents
doi: 10.1093/bib/bbaa298
Problem Solving Protocol
Abstract
Motivation: Annotating genetic variants from summary statistics of genome-wide association studies (GWAS) is crucial for
predicting risk genes of various disorders. The multimarker analysis of genomic annotation (MAGMA) is one of the most
popular tools for this purpose, where MAGMA aggregates signals of single nucleotide polymorphisms (SNPs) to their nearby
genes. In biology, SNPs may also affect genes that are far away in the genome, thus missed by MAGMA. Although different
upgrades of MAGMA have been proposed to extend gene-wise variant annotations with more information (e.g. Hi-C or
eQTL), the regulatory relationships among genes and the tissue specificity of signals have not been taken into account.
Results: We propose a new approach, namely network-enhanced MAGMA (nMAGMA), for gene-wise annotation of variants
from GWAS summary statistics. Compared with MAGMA and H-MAGMA, nMAGMA significantly extends the lists of genes
that can be annotated to SNPs by integrating local signals, long-range regulation signals (i.e. interactions between distal
DNA elements), and tissue-specific gene networks. When applied to schizophrenia (SCZ), nMAGMA is able to detect more
risk genes (217% more than MAGMA and 57% more than H-MAGMA) that are involved in SCZ compared with MAGMA and
H-MAGMA, and more of nMAGMA results can be validated with known SCZ risk genes. Some disease-related functions (e.g.
the ATPase pathway in Cortex) are also uncovered in nMAGMA but not in MAGMA or H-MAGMA. Moreover, nMAGMA
provides tissue-specific risk signals, which are useful for understanding disorders with multitissue origins.
Anyi Yang is currently a PhD candidate at the Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, China. She is interested
in bioinformatics.
Jingqi Chen is currently a young Associate Professor at the Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, China.
She studies the noncoding genome in psychiatric disorders, by integrating multiomics data.
Xing-Ming Zhao is currently a Professor affiliated with ISTBI, KLCNBI and RIICS, Fudan University, China. His current research interests include data mining
and computational systems biology.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
2 Yang et al.
Recently, some computational approaches have been pro- Materials and Methods
posed to aggregate genetic variants with weak effects to nearby
GWAS data
genes, among which the multimarker analysis of genomic anno-
tation (MAGMA) is most widely used (https://ctg.cncr.nl/softwa The two SCZ GWAS summary statistics datasets were down-
re/magma) [6]. Except for aggregating variants to genes, MAGMA loaded from Psychiatric Genomics Consortium (https://www.me
than ten and less than 1500 genes were considered. The details and the coefficient β reflects the association of the SNP with
of the gene-sets can be found in Supplementary Table S1. the phenotype and u0k yields the intercept. In a similar way, we
refined the model described in Eq. (1) as follows:
Network-enhanced MAGMA (nMAGMA)
Z = Sβ + ε (3)
assign SNPs to genes more thoroughly and accurately. Figure 1 the indirect correlation with an intermediate gene which might
provides a schematic view of the framework. In brief, several be noncoding, such as a lincRNA. Two genes tend to have a
types of supporting evidence for the connections between indi- high TOM value when they highly interconnect with each other
vidual SNPs and genes, including network-induced informa- directly and indirectly [16, 37]. The main difference between
tion, significant eQTL results and Hi-C data, were used in the nMAGMA and contemporary MAGMA-based strategies (i.e. H-
gene-wise variant annotation to generate tissue-specific anno- MAGMA and eMAGMA) is that it integrates network information
tation files for downstream gene and gene-set analysis, under which is tissue-specific, resulting in annotations that are bio-
the powerful framework of MAGMA. We refer to this enhanced logically and topologically more meaningful. Another advantage
framework as nMAGMA. After retrieving SNPs from the GWAS of nMAGMA is that it can be extended feasibly by adding new
summary statistics for interested traits, for fundamental anno- genomic signals and network information. The source code
tation, the SNPs located in the gene regions (the region between for nMAGMA can be freely downloaded at https://github.com/
the transcription start site and the ending site of that gene) sldrcyang/nMAGMA.
are assigned to their nearest genes based on their genomic
coordinates. For increasing the utilization of SNPs, especially
those in deep intergenic regions, additional SNP-gene pairs are Application of nMAGMA to SCZ GWAS data
generated based on chromatin interaction and significant eQTL
We applied nMAGMA to the summary statistics of two large
relationships (Materials and Methods section). We reason that
SCZ GWAS studies of European ancestry [2, 21]. Considering the
signals captured by Hi-C and eQTL may explain a large portion
availability of Hi-C data, we only focused on the four tissues
of heritability which cannot be explained by simple location-
previously found related to SCZ, i.e. Cortex, Hippocampus, Liver
based annotation. Importantly, we adopt WGCNA in order to
and Small Bowel [10, 18–20]. In total, 20 635 protein-coding genes
explicitly infer relationships between genes in the biological
and 17 224 gene-sets were considered for further analysis, and
networks. The TOM value (i.e. the corresponding value from the
the ‘SNP-wise = mean’ parameter from MAGMA was adopted
TOM) between two protein-coding genes is affected not only by
through all analysis.
the direct expressional correlation between them, but also by
nMAGMA: network-enhanced MAGMA 5
Compared with the results by the combinations of MAGMA chromatin interaction profiles from Hi-C can help detect SNPs
and Hi-C as well as eQTL (denoted as MAGMA+ eQTL, MAGMA+ from regions far away from the genes. Figure 2D shows the
Hi-C and MAGMA+ Hi-C+ eQTL), nMAGMA (equals to MAGMA+ Venn diagram of the risk genes inferred by nMAGMA across four
eQTL+ Hi-C+ gene network) can detect more risk genes across tissues, where we can see the tissue specificity of the risk genes
all four tissues, as shown in Figure 2A (the descriptive statistics detected in a certain tissue. Among the 683 genes that can be
of nMAGMA in each SNP-to-Gene annotation step were shown detected in all four tissues of nMAGMA, 591 of them (86.5%) can
in Supplementary Table S2). From the results, we can see that also be detected by MAGMA, suggesting that these genes may be
both Hi-C and eQTL can significantly extend the list of genes that indeed affected by SNPs inside it.
can be mapped to the coordinates of SNPs, while the network
information can further expand the number of significant genes
(by 11∼25%) compared with MAGMA+ Hi-C+ eQTL. As shown in
nMAGMA discovers more risk genes for SCZ
Figure 2B, we also noticed that the median number of SNPs that To evaluate the performance of nMAGMA, we compared it with
can be mapped to each gene was also significantly increased MAGMA and H-MAGMA, a variant of MAGMA that also utilizes
by MAGMA+ eQTL, MAGMA+ Hi-C and MAGMA+ Hi-C+ eQTL Hi-C information. Since the results by H-MAGMA are only avail-
compared with MAGMA, which confirms that some remote SNPs able in Cortex, we compared nMAGMA against H-MAGMA in
can be mapped to genes with the help of Hi-C and eQTL. When Cortex, while MAGMA has no tissue-specific results (same in all
looking at the overlap of SNPs that can be mapped to genes four tissues). Figure 3 shows the Venn diagrams of the results
by Hi-C or eQTL, we found that only a few SNPs (about 5%) by the three approaches across four tissues. We can see that
can be detected by both datasets, as shown in Figure 2C, which nMAGMA can detect more risk genes than both MAGMA and
implies that Hi-C and eQTL can complement with each other H-MAGMA, and has increased the identified risk genes by 57%
when annotating the SNP-gene pairs, where the cis-eQTLs can compared to H-MAGMA and by 67∼100% compared to MAGMA
annotate SNPs from the proximal regions of genes while the (217% for a union of four tissues) (Figure 3A–D). nMAGMA has
6 Yang et al.
more consistent results with MAGMA, replicating about 89∼93% from [21] and 108 from [2] with one overlap), which were then
of the results by MAGMA (99% for a union of four tissues results), used as input for iRIGS, resulting in 105 high-confidence risk
while only 60% of MAGMA-detected genes can be replicated genes identified by iRIGS. Among the 105 risk genes given by
by H-MAGMA. Fifty-eight percent of the genes detected by H- iRIGS, nMAGMA can successfully recover 49, 44, 41 and 45 genes
MAGMA can be found by nMAGMA. From the results above, we in Cortex, Hippocampus, Liver, and Small Bowel, respectively.
can see that most of the results from MAGMA and H-MAGMA can On the other hand, MAGMA can recover 41 iRIGS risk genes and
be replicated by nMAGMA, which implies that the results from H-MAGMA can also recover 41 iRIGS risk genes. The validation of
nMAGMA are confident and reliable. The detailed association of the results by nMAGMA with known or other predicted risk genes
significant genes detected by the three methods can be found imply that nMAGMA is reliable for gene-wise SNP annotations.
in a supplementary table available at https://github.com/sldrcya By looking over the risk genes that can be detected by all
ng/nMAGMA. three tools, we noticed that most of them have a function
To see whether nMAGMA can detect more confident risk known related to SCZ, e.g. the calcium channel and signaling
genes, we validated the results by nMAGMA, H-MAGMA and genes (CACNA1C, CACNB2), the glutamatergic neurotransmission
MAGMA with known SCZ risk genes collected from public genes (GRIN2A, GRM3) and the neurogenesis genes (SATB2, SOX2).
databases. For fair comparison, we compared the three By examining the unique risk genes inferred by nMAGMA, we
approaches to see how much of their top predictions can be noticed that many of them have been reported related to SCZ.
validated. Table 1 shows the results of the three tools with For instance, the gene H2BC9 (Cortex: P = 5 × 10−12 , Hippocampus:
respect to their top 100, 500, 1500 and 2500 predictions. We P = 7 × 10−10 ) was differentially expressed in lymphoblastoid cell
can see that more of the results by nMAGMA can be validated lines of SCZ cases versus controls [38]. H2BC9 is a member of
with known SCZ risk genes, whereas the most confident risk the H2B histone family and is located in the extended major
genes can be detected based on their physical locations to the histocompatibility complex (xMHC) region, which is found to be
significant SNPs detected in GWAS, likely due to that many the locus containing several most significant variants for SCZ
known SCZ genes are identified through previous GWAS studies. in a GWAS study [39]. In addition to the brain tissues, Liver
In addition, we validated our results with those by iRIGS, a useful and Small Bowel have also been found involved in SCZ from
approach to infer genes from risk loci of GWAS by integrating our results. nMAGMA predicted AKT1 (v-akt murine thymoma
multiomics data (e.g. Hi-C and de novo mutations) and gene viral oncogene homolog 1) from Liver (P = 1.12 × 10−6 ) and SYN2
networks derived from Gene Ontology [14]. A total of 116 risk (Synapsin II) from Small Bowel (P = 1.85 × 10−11 ) to be risk genes
loci were collected from the GWAS datasets that we used (9 of SCZ. AKT1 has been reported to be a SCZ risk gene related
nMAGMA: network-enhanced MAGMA 7
Table 1. Comparison of the results by MAGMA, H-MAGMA and nMAGMA, and the numbers of genes that can be validated with known SCZ risk
genes
Methods Top 100 Top 500 Top 1500 Top 2500 iRIGS genes
to neurocognition [40], and cognitive deficit is a typical symp- SCZ in Prefrontal Cortex [56]. Most of the gene-sets detected
tom in SCZ [41]. The gene SYN2 functions in synaptogenesis, in both Cortex and Hippocampus have been found related
and the dysfunction of synaptic transmission is known in the to SCZ, such as those enriched in biological processes on
fundamental pathology of SCZ [42, 43]. The risk genes inferred synapse, oligodendrocyte and fragile-X mental retardation
above may confirm the involvement of non-brain tissues in SCZ protein (FMRP). Oligodendrocytes belong to glial cells in the
[10, 20]. nervous system, and the dysfunction of oligodendrocytes has
been frequently found associated with SCZ and bipolar disorders
[57, 58]. FMRP affects the development and maturation of
nMAGMA identifies more meaningful gene-sets for SCZ synapses between neurons, and the loss of FMRP can inhibit
neuronal translation and synaptic function, causing psychiatric
With the genes identified above by the three approaches, we next symptoms including autistic and schizophrenic features [59, 60].
investigated the gene-sets enriched in those genes to see how Among the two non-brain tissues, we noticed that the immune
the genes are involved in SCZ. The genes inferred by nMAGMA functions (e.g. T cell, B cell and Interferon) were extensively
from four tissues were enriched in 32 (Cortex), 67 (Hippocam- affected in Liver, which involves multiple cells like lymphocytes
pus), 32 (Liver) and 38 (Small Bowel) gene-sets, respectively. that are related to the immunoreaction processes [61]. This
The genes inferred by MAGMA were enriched in 19 gene-sets suggests that the dysfunction of the immune processes in Liver
while only four gene-sets were enriched with genes inferred may be related to SCZ. What’s more, the metabolic processes
by H-MAGMA (Supplementary Table S3). The gene-sets detected (RNA, acid and amines metabolism) were found significant in
by MAGMA include six synaptic and neuronal gene-sets and Small Bowel, where the abnormal metabolic processes caused by
two RNA-binding gene-sets, while the gene-sets detected by H- unhealthy intestinal microbiota can influence brain chemistry
MAGMA have similar functions. It has been reported that the and contributes to psychiatric disorders [62, 63]. Our observation
dysfunction of synaptic and neuronal genes is involved in SCZ of SCZ-associated gene-sets in Liver and Small Bowel implies
and Bipolar disorders [44], and the copy number variation of RNA the association between non-brain tissues and psychiatric
binding proteins like RBFOX1 contributes to the pathology of SCZ disorders.
and autism [45, 46].
We are more interested in the gene-sets detected by
nMAGMA but not by MAGMA and H-MAGMA. Compared to The network modules uncover biological processes
the general synaptic functions enriched in the genes detected implicated in SCZ
by MAGMA and H-MAGMA, the results by nMAGMA imply
that both presynaptic and postsynaptic processes are involved We further looked at the gene networks constructed by WGCNA,
in SCZ, which is consistent with previous findings that where two genes were linked if their TOM is no less than 0.15.
postsynaptic glutamatergic process and molecules regulating We focused on the genes significant in nMAGMA. We are more
presynaptic transmitter release are both implicated in SCZ interested in the module structures consisting of these intensely
[47, 48]. Interestingly, nMAGMA identified many gene-sets connected genes, which were detected by Molecular Complex
related to CHD8 (chromodomain-helicase-DNA binding protein Detection (MCODE) [64] with default parameters, and only the
8). CHD8 is a chromatin modifier that encodes the ATP- modules with more than 10 genes were kept. At last, a total
dependent chromatin-remodeling factors and was found related of 16 modules were detected across the four tissues as shown
to neurodevelopmental disorders like SCZ and autism by in Table 3. The functional terms enriched in the 16 modules
bothering gene expression and regulation in the human brain were detected with the Database for Annotation, Visualization
[49–51]. It is noteworthy that the gene-sets associated with other and Integrated Discovery (version 6.8) [65], where only the most
non-psychiatric diseases (e.g. Breast cancer, Crohn’s disease enriched term was shown for each module in Table 3 for clarity.
and Liver cancer) were also found significant in nMAGMA, in Figure 4 shows the six modules detected in Cortex, which
accordance with previous findings that people with SCZ also was visualized with Cytoscape (Version 3.7.2) [66]. Module 1
suffer from these non-psychiatric disorders at the same time contained 181 genes and was enriched with SCZ risk genes
[52–55]. (including 49 risk genes). Module 1 was enriched in myelin
Table 2 shows the functional categories of the gene-sets sheath (P = 3.03 × 10−9 ), which is a membrane surrounding the
detected by nMAGMA across four tissues. We noticed that the axons of neurons and is the outgrowth of glial cells. Abnormality
ATPase-related gene-sets were detected in Cortex but not in of myelin has been reported to be causally involved in SCZ in
the other three tissues. ATPase like Na+/K+ ATPase-α1 is a the Prefrontal Cortex [67], where several myelin-related genes
potential modulator of glutamate ingestion in the brain and were also found to reduced expression in SCZ patients com-
has been reported to contribute to the pathophysiology of pared to controls [68]. What’s more, the gene RTN4 in Module
8 Yang et al.
Table 2. Biological processes underlying gene-sets detected by nMAGMA for SCZ across tissues
Table 3. Details of the 16 modules detected by nMAGMA across four tissues, and their most enriched functional terms
1 that encodes myelin-related proteins was detected differential first-choice for SCZ treatment) was found before [73]. Altogether,
expression in the Cortex of SCZ by inhibiting neurodevelopment our analysis validates that network metrics can help to identify
in the previous research [69]. The myelin-related function was new groups of interconnected genes with known roles in SCZ
also enriched in module 4 of Cortex and module 3 of Hippocam- and reveal hidden enrichments in both brain and non-brain
pus. tissues.
The functional term of the structural constituent of ribo-
some was found most enriched in Hippocampus (module 1) and
The cause of inflation of test statistics
Small Bowel (module 3), as well as an immune-related term
in the gene-set analysis
of defense response to virus in Liver. Visualized modules were
shown in Supplementary Figure S1. These agree with former We also constructed a model to distinguish the confounding
studies on functional enrichment of differentially expressed biases from polygenicity in gene-set analysis results (Materials
genes or long noncoding RNA in SCZ, where pathways associated and Methods section). Both of the GWAS summary datasets
with ribosome and immune response were also enriched [70, used here showed an inflation of gene-set associations in
71]. Another, the functional component of lipid droplet in Small MAGMA and nMAGMA (Cortex) (Figure 5). The low intercept
Bowel (P = 5.81 × 10−8 , module 1) is a main organelle that has (1.13 and 1.02 in MAGMA and nMAGMA, respectively) of dataset1
important roles in regulating lipid metabolism [72]. Slight hurt of (Supplementary Table S4) indicated that at most a small portion
lipid metabolism in SCZ cases under risperidone monotherapy (a of the inflation was caused by the true multigene pathogenicity
nMAGMA: network-enhanced MAGMA 9
pathogenesis in SCZ. Therefore, the downstream workflow origins of diseases. Strong evidence of the latent role of non-
used for gene-set analysis is not necessary [59], which is only brain tissues also presents in SCZ in our results, while non-brain
designed for confounding correction, namely large intercept tissues are less discussed in previous studies on psychiatric
in Eq. (6). However, our testing model is not strictly necessary diseases. After characterizing the topology and subnetwork
in gene-set analysis, as only dozens of risk gene-sets exist in structures of nMAGMA-detected genes, we show that network
gene-set analysis, in contrast to thousands of causal SNPs in information can be used to uncover hidden disease-related
GWAS. We have provided this model as an optional add-on enrichments.
for perspective users who expect slightly more rigorous and Though the results of nMAGMA are excellent in many
conservative results. aspects, there are still some potential directions for further
expansion of our work. First, we only focused on four relevant
tissues in this work. nMAGMA can be extended to other
tissues by using other functional genomic resources such as
Discussion BrainVar [79]. Second, the multidimensional data from matched
In this paper, we provide a method leveraging multidimensional samples and tissues will reduce data heterogeneity, which are
resources to extract comprehensive biological information from unfortunately rarely available. With more data available in the
GWAS. The challenge is to incorporate tissue-specific gene future, the performance of nMAGMA is expected to be further
networks into the annotation, which have been proven helpful improved. Third, more gene interaction resources can be utilized
for prioritizing candidate disease genes that may otherwise for the gene-wise annotation. Other types of networks, for
be ignored by conventional approaches. Instead of roughly example, the protein–protein interaction networks [74] and the
computing correlations among genes, nMAGMA specifies the disease gene networks [75], may be further incorporated into
gene regulatory relationships from TOM matrix, which is more nMAGMA.
explicit and biologically meaningful to promote inference Altogether, nMAGMA provides a strategy for using network
accuracy. To cover more genomic signals with high confidence, information to select potentially risk genes and identify the rel-
Hi-C and eQTL identified SNP-gene relationships were also evant biological function. We hope this method can be useful for
taken into account in nMAGMA. Unlike many other studies deciphering disease mechanisms and selecting therapeutic drug
that explore the expression profiles of selected significant candidates, and that it can provide a basis for more innovative
genes, nMAGMA directly incorporates the expression data into methods in this area.
the upstream analysis, making the downstream analysis more
reliable.
Key Points
Empowered by the integration of gene networks and
multiomics information, nMAGMA shows greater ability in • We proposed a novel network-enhanced method
detecting risk signatures for SCZ compared with MAGMA and called nMAGMA, leveraging multi-dimensionally
H-MAGMA. Featuring tissue-specific results, nMAGMA also con- tissue-specific resources.
tributes to elucidating the underpinnings of the tissue-specific
10 Yang et al.
Acknowledgments
• When applied to SCZ, the great improvement of
nMAGMA was shown for identification of risk We deeply appreciate the researchers comprise the Psy-
genes and gene-sets compared with MAGMA and chiatric Genomics Consortium (PGC), and the hundreds of
H-MAGMA. thousands of contributors who have provided the data.
• Results of nMAGMA demonstrated rich tissue speci-
ficity and revealed hidden signatures in both brain and
non-brain tissues. Funding
This work was supported by the National Natural
Science Foundation of China (NSFC) [61932008, 61772368];
Shanghai Municipal Science and Technology Major Project
[2018SHZDZX01]; Shanghai Science and Technology Innova-
tion Fund [19511101404]; and Zhangjiang Lab.
Supplementary Data
Supplementary data are available online at https://academic.
oup.com/bib. References
1. Pardinas AF, Holmans P, Pocklington AJ, et al. Common
schizophrenia alleles are enriched in mutation-intolerant
genes and in regions under strong background selection. Nat
Genet 2018;50:381–9.
Availability
2. Schizophrenia Working Group of the Psychiatric Genomics
Codes and data used in this work are freely available at C. Biological insights from 108 schizophrenia-associated
https://github.com/sldrcyang/nMAGMA. genetic loci. Nature 2014;511:421–7.
nMAGMA: network-enhanced MAGMA 11
3. Cooper GM, Goode DL, Ng SB, et al. Single-nucleotide evo- 24. Hunt SE, McLaren W, Gil L, et al. Ensembl variation resources.
lutionary constraint scores highlight disease-causing muta- Database (Oxford) 2018;2018. doi: 10.1093/database/bay119.
tions. Nat Methods 2010;7:250–1. 25. Yang D, Jang I, Choi J, et al. 3DIV: a 3D-genome interaction
4. Cooper GM, Shendure J. Needles in stacks of needles: finding viewer and database. Nucleic Acids Res 2018;46:D52–7.
disease-causal variants in a wealth of genomic data. Nat Rev 26. Consortium EP. An integrated encyclopedia of DNA ele-
network linked to brain development and autism. Cell Rep 60. Fatemi SH, Folsom TD. GABA receptor subunit distribution
2014;6:1139–52. and FMRP–mGluR5 signaling abnormalities in the cerebel-
46. Xu B, Roos JL, Levy S, et al. Strong association of de novo lum of subjects with schizophrenia, mood disorders, and
copy number mutations with sporadic schizophrenia. Nat autism. Schizophr Res 2015;167:42–56.
Gen 2008;40:880–5. 61. Nemeth E, Baird AW, O’Farrelly C. Microanatomy of the liver