You are on page 1of 8

Journal of the American Medical Informatics Association, 25(11), 2018, 1452–1459

doi: 10.1093/jamia/ocy117
Advance Access Publication Date: 23 October 2018
Research and Applications

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


Research and Applications

Heterogeneous network embedding for identifying


symptom candidate genes
Kuo Yang,1 Ning Wang,1 Guangming Liu,1 Ruyu Wang,1 Jian Yu,1 Runshun Zhang,2
Jianxin Chen,3 and Xuezhong Zhou1,4
1
School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing
Jiaotong University, Beijing, China 2Guanganmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China 3Beijing
University of Chinese Medicine, Beijing, China and 4Data Center of Traditional Chinese Medicine, China Academy of Chinese
Medical Sciences, Beijing, China

Corresponding Author: Xuezhong Zhou, PhD, Beijing Jiaotong University, No.3 Shangyuancun, Haidian District, Beijing,
100044 (xzzhou@bjtu.edu.cn)
Received 8 March 2018; Revised 24 July 2018; Editorial Decision 9 August 2018; Accepted 11 August 2018

ABSTRACT
Objective: Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine
disease taxonomy and improve the personalized management of chronic diseases. Although there are abun-
dant experimental studies and computational efforts to obtain the candidate genes of diseases, the identifica-
tion of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene
associations and proposed a heterogeneous network embedding for identifying symptom genes.
Methods: We proposed a heterogeneous network embedding representation algorithm, which constructed a
heterogeneous symptom-related network that integrated symptom-related associations and applied an embed-
ding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the
relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of
given symptoms can be obtained.
Results: A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes
was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results
indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision
and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and associ-
ation precision improved by 37.71% (0.723 vs 0.525) over the PRINCE.
Conclusions: The experimental validation of the algorithms and the literature validation of typical symptoms
indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479
symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations
of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings.

Key words: heterogeneous network embedding, symptom gene identification, network medicine

INTRODUCTION refine disease taxonomy.2 In recent years, increasingly more pheno-


Symptoms and signs (called symptoms in brief) are the primary evi- type (disease and symptom) databases, such as Human Phenotype
dence for clinical diagnosis and disease classification.1 As a critical Ontology (HPO),3 Human Disease Ontology (DO),4 and Orphanet
layer connecting exposomes and genomes in the knowledge net- Rare Disease Ontology (Orphanet)5 have been constructed. Most
work, symptoms play an important role in precision medicine to biomedical researchers are mainly focused on analyzing and

C The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.
V
All rights reserved. For permissions, please email: journals.permissions@oup.com
1452
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1453

understanding the molecular mechanism of disease phenotypes.6–8 The curated disease-gene associations between 13 074 diseases with
investigation of the underlying molecular mechanisms of symptom UMLS code (CUI) and 8947 genes from the DisGeNet database,
phenotypes has rarely been addressed, except for disease conditions which integrates disease-gene associations from UniProt,24 PsyGe-
overlapping with symptom phenotypes, such as obesity9 and pain.10 In NET,25 ClinVar,26 Orphanet,5 the GWAS Catalog,27 CTD28 and
addition, to impel the study of genome and phenotypes, the U.S. Na- HPO3 databases. Second, we collected 73 064 disease-gene associa-
tional Human Genome Research Institute initiated 2 projects, tions between 6118 diseases with CUIs and 8370 genes from the
eMERGE,11 which correlates whole genome scans with phenotype data Malacards database. To unify and integrate the disease terms, we

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


extracted from the electronic medical record systems and PhenX12 mapped the original disease identifiers of the 2 databases to Unified
which provides investigators with high-priority, well-established, low- Medical Language System (UMLS) codes. Finally, the 2 data sources
burden standard measures to collect phenotypic and environmental data were integrated to obtain 196 397 disease-gene associations that in-
for large-scale genomic studies. Jyotishman et al13 adopted multiple clude 16 594 unique diseases and 11 497 unique genes.
standards and biomedical terminologies to promote cross-study pooling
of data and complex genotype-phenotype associations detection. Protein-protein interactions
Similar to the computational approaches for disease-gene predic- The protein-protein interactions (PPIs) were collected from Menche
tion, symptom gene identification is also a key task for revealing the et al,29 and include 213 888 records with 15 964 unique proteins.
underlying molecular mechanisms of symptoms. Gene prediction of These data are integrated PPI data derived from multiple data sour-
given diseases requires extensive experiments to test hundreds of can- ces, such as HPRD,21 BioGRID,22 IntAct23 and PINA.30
didate genes in a wet lab.14 In fact, experimental gene identification
for symptoms and diseases is a difficult and time-consuming task.15
The success of network-based computational methods for identifying Disease-symptom associations
disease genes8,14,16 demonstrated that it is an effective method for dis- Disease-symptom associations were collected from the DO,4 HPO3
ease gene prediction. There exists preliminary work1 that indicates it and Orphanet5 databases (Figure 2). To unify the disease terms from
is feasible to use a network propagation approach to predict the can- the different datasets, we mapped the original disease codes to
didate genes of symptoms and complicated factors involved in the in- UMLS codes. We collected 1008 disease-symptom associations be-
fluence of prediction performance.17 In addition, recent increasing tween 204 diseases and 417 symptoms from the DO database,
curation of large-scale symptom-related association data, such as 87 442 disease-symptom associations between 4366 diseases and
disease-gene associations (eg OMIM,18 DisGeNet19 and Malacards20) 6176 symptoms from the HPO database, and 35 039 disease-
symptom-disease associations (Disease Ontology,4 HPO3 and Orpha- symptom associations between 2391 diseases and 3721 symptoms
net5) and protein-protein interactions (HPRD,21 BioGRID,22 and In- from the Orphanet database. By integrating the 3 data sources, we
tAct23) offer a rare opportunity for the development of computational finally obtained 100 305 distinct disease-symptom associations
approaches. However, to substantially promote these efforts, we still (DSA) between 5605 diseases and 6935 symptoms.
need to address 2 essential tasks: curation of a high-quality
benchmark dataset and making full use of the heterogeneous Benchmark dataset construction of symptom-gene
symptom-related indirect association data, such as symptom-disease associations
associations, disease-gene associations and protein-protein interac- By integrating symptom-related and gene-related association data,
tions to improve the symptom gene prediction performance. we curated a benchmark dataset of symptom-gene associations
Here, by integrating symptom-disease and disease-gene associa- (called BDSG) (Figure 2). In particular, to obtain the high quality
tions, we curated a benchmark dataset of symptom-gene associations. symptom gene associations, we utilized the phenomenon of some
We proposed a deep embedding representation algorithm on a hetero- “Dual Phenotypes” (DP), such as obesity, fever, back pain, and ver-
geneous symptom-related network to identify symptom genes tigo, which are not only regarded as diseases, but also as symptoms
(Figure 1). First, we constructed a heterogeneous symptom-related in medical fields. The associated genes of symptoms with DP charac-
network, which includes symptom-disease, disease-gene and protein- teristics can be directly derived from the disease-gene associations
protein associations. Then, the network embedding representation with high quality assurance. To identify these kinds of phenotype
algorithm was applied to construct low-dimensional vector represen- terms with DP characteristics, we utilized the hierarchical tree codes
tation (LVR) of nodes (symptoms and genes) in the network. By calcu- (eg C08: respiratory tract diseases and C08.618.248: cough) from
lating the relevance between symptoms and genes that were measured MeSH31 terminology to relate the disease terms in our dataset. First,
by the similarities of their vectors, the candidate genes of symptoms we collected 1051 symptom terms whose MeSH tree codes start
can be obtained. We compared the prediction performance of our al- with C23.888. Second, we extracted the disease term list and symp-
gorithm to the baseline algorithms (FSGER and PRINCE). The experi- tom term list from DSA, respectively, and identified the DP symp-
mental results indicated our algorithm achieved a significant toms by intersecting the 2 lists. After obtaining the union set of the
improvement over baseline algorithms. Finally, a high-quality predic- aforementioned 2 symptom lists, we curated 1278 symptoms with
tion dataset of symptom-candidate gene associations was curated distinct UMLS CUIs. Then, by intersecting the CUIs from the dis-
based on the results predicted by our method. eases in the integrated disease-gene associations, we obtained 505
symptoms with the DP characteristics, from which we finally cu-
rated 18 270 high quality symptom-gene associations (Supplemen-
tary Material S1) between these 505 symptoms and 4549 genes. In
METHODS
addition, to curate a more comprehensive symptom-gene benchmark
Dataset dataset, we further collected the symptom-gene associations derived
Disease-gene associations from the SEMMED32 database, which offered semantic predictions
Disease-gene associations were collected from the DisGeNet19 and from the titles and abstracts of PubMed33 literatures. We extracted
Malacards20 databases (Figure 2). First, we extracted 130 820 the gene-related semantic predictions about symptom terminologies
1454 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


Figure 1. An overview of LSGER method. First, by integrating disease-symptom, disease-gene, and protein-protein associations (a), a heterogeneous symptom-
related network (b) was constructed. Then, the network embedding algorithm was applied to obtain a low-dimensional vector representation of nodes (c). Finally,
the relevance between the symptom and gene nodes can be measured by the similarities of their vectors (d). By sorting predicted genes by relevance, the candi-
date genes of given symptoms can be identified.

Figure 2. A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations
from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then,
we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270
symptom-gene associations was curated.
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1455

and finally obtained 50 907 symptom-gene associations (called feature space, the low-dimensional vector features of nodes can be
SPSG) between 932 symptoms and 9382 genes. measured using stochastic gradient ascent over the model.
We constructed 2 heterogeneous networks, SDGNet, which inte-
grated symptom-disease and disease-gene associations and
Fisher-based statistics model for symptom gene
SDGPNet, which integrated symptom-disease, disease-gene, and
prediction
protein-protein associations. Given a heterogeneous network
Based on the Fisher exact test,34 we proposed a Fisher-based statisti-
G ¼ ðV; EÞ, V and E represented the nodes and edges of the net-

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


cal model to predict symptom genes (FSGER) as a baseline method.
work. Then, we applied the network embedding representation al-
Based on the symptom-disease and disease-gene associations, we
gorithm to learn the LVR of nodes. Finally, the node v can be
considered the diseases as a bridge to connect symptoms and genes.
mapped to a low-dimensional vector Nv .
In detail, for symptom s and gene g, we defined a, b, c and d to rep-
resent the number of diseases associated with s and g, associated
with s but not g, associated with g but not s and associated with nei- LVR-based similarity prediction model to identify
ther s nor g, respectively. The relevance Relðs; gÞ between the symp- symptom genes
tom s and the gene g can be defined as follows: We can obtain the LVR of the nodes in the given network based on
a heterogeneous network embedding representation algorithm. The
ða þ bÞ!ðc þ dÞ!ða þ cÞ!ðb þ dÞ!
Relðs; gÞ ¼ 1  low-dimensional vector features of nodes fused the local structure
a!b!c!d!n!
(neighbor of nodes) and global structure information of the net-
where n represents the number of all the related diseases. Then, by work. Then, we proposed a LVR-based similarity model for symp-
ranking the predicted genes by the relevance, the ranking gene lists tom gene prediction (LSGER). The relevance between the symptom
of given symptoms can be obtained. and gene nodes can be measured by the similarities of their low-
dimensional vectors. Mathematically, given the symptom node vs
 
and the gene node vg , we can measure the relevance Rel vs ; vg be-
Heterogeneous symptom-related network embedding
tween them by calculating the LVR-based cosine sim-
representation  
ilarity cos Nvs ; Nvg of their vectors Nvs and Nvg as follows:
Network embedding representation learning35 is an effective algo-
rithm for learning the low-dimensional feature vectors of the nodes     Nvs  Nvg
Rel vs ; vg ¼ cos Nvs ; Nvg ¼  
in a given network, and it can effectively preserve the local and jNvs j  Nvg 
global structure information of the network. Network embedding
By calculating and sorting the correlations between query symp-
representation methods are applicable in many tasks, such as visuali-
tom and all candidate genes, we can obtain a ranking list of candi-
zation, label classification and link prediction.35 In this study, we
date genes for the query symptom. Otherwise, for the symptom vs ,
constructed a heterogeneous symptom-related network, and applied
we designed a pre-selection strategy of candidate genes: selecting the
the network embedding algorithm node2vec35 to obtain the low-
genes of diseases related to vs as candidate gene pool and compared
dimensional vector representation of the nodes in the network.
to no-selection strategy: selecting all genes as a candidate gene pool.
As a well-known algorithm for network embedding representa-
Based on the 2 strategies, the 2 variants LSGER-AG (all genes) and
tion, the main idea of node2vec is to learn a mapping of nodes to a
LSGER-DG (with filtered disease gene) of LSGER algorithm were
low-dimensional space of features that maximizes the likelihood of
proposed.
preserving network neighborhoods of nodes. In detail, for a given
network G ¼ ðV; EÞ, the aim of node2vec is to learn the mapping
function f : V ! Rd (parameter d is the number of feature dimen- Experimental setting and evaluation
sions) from nodes to feature representations. By applying the Skip- We constructed 2 benchmark datasets of symptom-gene associations
Gram architecture to the network,36,37 the objective function can be (BDSG and SPSG), which can be used to evaluate the prediction per-
optimized by maximizing the log-probability of observing the net- formance of different algorithms. In the experiment, we removed all
work neighborhood Ns ðuÞ for node u conditioned on its feature rep- the known genes of the symptoms in the benchmark dataset and pre-
resentation as follows: dicted the candidate genes of every test symptom, which indicated
X that there were not any priori symptom-gene associations for all the
max logPrðNs ðuÞjf ðuÞÞ prediction algorithms. Our method was compared to the baseline
f
u2V
algorithms FSGER and PRINCE.1 Foremost, the PRINCE was pro-
For the node u 2 V, its network neighborhoods Ns ðuÞ can be gener- posed by Vanunu et al38 to predict disease genes. Li et al1 extended
ated through a neighborhood sampling strategy S. The authors of the PRINCE and applied it to the task of symptom genes prediction.
node2vec proposed a biased random walk strategy, which can flexi- In their work, a network propagation method was used in the PPI
bly and efficiently explore the diverse neighborhoods of nodes. network to obtain priority scores of candidate genes. The FSGER al-
Given a source node u, the random walk of fixed length i can be sim- gorithm is a Fisher-based statistics model that connected
ulated, and node ci (that is, the i-th node in the random walk, and disease-symptom and disease-gene associations for symptom genes
c0 ¼ u) was generated by the distribution function: prediction.
8p We adopted precision (PR), recall (RE), F1-score (F1),39 associa-
< vx if ðv; xÞ 2 E
Z tion precision (AP) and area under curve (AUC) as the evaluation
Pðci ¼ xjci1 ¼ vÞ ¼
: metrics. Given a test symptom set S with m symptoms, for every test
0 otherwise
symptom s 2 S, TðsÞ represents the test gene set of symptom s.
where pvx is the unnormalized transition probability between nodes Given a ranking list of predicted genes, we selected the top i genes
v and x, and Z is the normalizing constant. By applying the 2 stan- Ri ðsÞ of the ranking list (i ¼ 3; 10) as candidate genes. The precision,
dard assumptions, conditional independence and symmetry in the recall and F1-score for TOP@i can be defined as follows:
1456 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11

1 X jTðsÞ \ Ri ðsÞj Pearson similarity (Sim_pea), to measure the vector similarities of


Precision ¼
M s2S jRi ðsÞj symptom and gene nodes. The results predicted by LSGER-AG algo-
rithm with the SDGNet and SDGPNet networks indicated that dif-
1 X jTðsÞ \ Ri ðsÞj ferent similarity metrics had some degree of influence on the
Recall ¼
M s2S jTðsÞj prediction performance of our algorithm. For example, in term of
precision (PR) and recall (RE) for TOP@3, the prediction algorithm
precision  recall with Sim_pea (PR ¼ 0.852; RE ¼ 0.314), Sim_eu (PR ¼ 0.871;
F1  score ¼ 2 

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


precision þ recall
RE ¼ 0.318) and Sim_cos (PR ¼ 0.844; RE ¼ 0.311) obtained similar
The recall was calculated in the top 3 or 10 candidate genes, which performances on recall but different results on precision measure. In
may lead to low recall values. Since we used the same mode of calcu- the SM section, we also compared the performance of symptom-gene
lating the recall, it is fair for all the prediction algorithms. In addi- prediction algorithms on the SPSG dataset. The prediction results in-
tion, for every test symptom s, the top k genes Rk ðsÞ of ranking list dicated that the LSGER-DG with SDGPNet still obtained the best
were also selected (k equals to the number of test genes of symptom performance: compared to the PRINCE algorithm, the recall and F1-
s). The association precision can be defined, as follows: score improved by 35.32% and 64.24%, respectively. Compared to
P the BDSG dataset with highly credible symptom-gene associations,
jTðsÞ \ Rk ðsÞj the prediction associations offered by the SEMMED had a low confi-
AP ¼ s2S P
s2S jRk ðsÞj dence. Therefore, the evaluation results on the BDSG dataset can be
In addition, we also used the AUC to evaluate the prediction perfor- of greater value than those on the SPSG dataset. From the above, our
mance. For every test symptom, we selected the top 100 predicted method had a higher performance than other prediction algorithms.
genes as candidate genes and obtained the predicted scores of
symptom-candidate genes pairs. Then, we ranked all the symptom- Case study: candidate genes of some typical symptoms
candidate gene pairs by the scores and calculated the AUC values. To illustrate the performance of prediction algorithm, we showed
Compared to the AUC calculation of homogeneous network in link the prediction performance using LSGER-AG with SDGPNet of sev-
prediction tasks, the AUC calculation in this study may lead to the eral typical symptoms (Table 2), including constipation (CUI:
inapposite AUC of prediction results. Hence, the AUC evaluation is C0009806), nausea (CUI: C0027497), pain (CUI: C0030193),
only a supplement to the other metrics. Usher syndromes (C1568248), vision disorders (C0042790), and
aphasia (C0003537), which are regarded as DP symptom terms. The
top 10 candidate genes of these symptoms were also listed (Table 3),
and the bold genes in the table are the known genes of these symp-
RESULTS
toms. For example, for constipation, the top 9 candidate genes are the
LVR-based similarity model to predict symptom genes known genes (PR ¼ 0.9 for TOP@10). In addition, for the candidate
For LSGER, we compared it to the PRINCE and FSGER algorithms. genes (Table 3) of pain, we found 9 benchmark genes and the left gene
We adopted precision, recall, F1-score for TOP@3 and TOP@10, as- ZNF470 (rank ¼ 5) was related to amyotrophic lateral sclerosis
sociation precision and AUC as evaluation metrics. For LSGER-AG (ALS).40 We searched HPO3 and found that pain is one of the typical
and LSGER-DG algorithms, we used 2 heterogeneous networks, symptoms of ALS. Therefore, ZNF470 might be a novel gene for pain.
SDGNet and SDGPNet, as test networks. We further evaluated the predicted genes of pain by additional vali-
First, the experimental results (Table 1) on the BDSG dataset dations from PPI interactions and genetic functional analysis. In partic-
show that, compared to the baseline algorithm PRINCE ular, we extracted the interaction of the top 49 predicted genes of pain
(AP ¼ 0.525; PR ¼ 0.506 and RE ¼ 0.202 for TOP@3), the FSGER in the context of the whole PPI network and showed the interaction
algorithm achieved slightly better performance: AP improved by map of them (Figure 3a), which includes 36 benchmark genes and 13
2.10%; PR and RE improved by 20.55% and 17.33%, respectively, novel candidate genes. There are dense interactions (95 interactions)
for TOP@3. The LSGER-AG with SDGPNet yielded the best perfor- between those benchmark genes and the novel candidate genes com-
mance: compared to PRINCE, AP improved by 37.71%; AUC im- pared to the interactions with random controls (p-value ¼ 6.82e-68),
proved by 21.60%; PR and RE improved by 66.80% and 53.96%, which indicated that the novel genes are located close to benchmark
respectively, for TOP@3. Second, the LSGER algorithm with genes in the PPI network. Further enrichment analysis (Gene Ontology
SDGPNet obtained slightly higher performance than did the and Pathway) of the pain predicted genes obtained similar results
SDGNet (LSGER-AG: PR and RE improved by 1.69% and 3.67%, (Figure 3b). For example, there are 9 candidate genes and 11 known
respectively, for TOP@3; LSGER-DG: PR and RE improved by genes on the neuroactive ligand-receptor interaction pathway (p-val-
1.58% and 3.32%, respectively, for TOP@3), which indicated that ue ¼ 9.90E-15). Therefore, additional analysis indicated that there ex-
the fusion of more gene-related information (PPI network) improved ist heavy interactions among the candidate genes and known genes of
prediction performance of LSGER algorithm. Finally, in terms of pre- pain, which partially validate the rationality of the prediction results.
cision and recall for TOP@3, both LSGER-AG and LSGER-DG had To fully evaluate the candidate genes that were not recorded in
similar prediction performance. However, in terms of AP, the predic- the BDSG dataset, we manually searched the recently published bio-
tion performance of LSGER-DG was better than that of LSGER-AG medical papers to verify the novel candidate genes. For example, for
(with SDGNet: AP improved by 6.31%; with SDGPNet: AP improved the novel candidate genes of Usher syndromes (PR ¼ 0.7 for
by 9.54%), which indicated the candidate gene pre-selection improved TOP@10), we found that Jaworek et al41 confirmed the locus (chro-
the prediction performance of the LSGER algorithm. mosome 10p11.21-q21.1) of USH1K gene (rank ¼ 3) associated with
Furthermore, we have performed the comparative experiments type 1 Usher syndrome. The candidate gene USH1H (rank ¼ 4) is
with different similarity metrics in the supplementary materials likely to associate with the Usher syndrome as well, which was inves-
(SM). We have selected 3 classical similarity metrics, cosine tigated by Dad et al42 In addition, for all 4 novel candidate genes in
similarity (Sim_cos), Euclidean distance similarity (Sim_eu) and the top 10 gene list of vision disorders, we found positive validations
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1457

Table 1. The performance comparison of symptom gene prediction algorithms

TOP@3 TOP@10

Network Algorithm AP AUC Precision Recall F1-score Precision Recall F1-score

– PRINCE 0.525 0.736 0.506 0.202 0.211 0.420 0.371 0.296


– FSGER 0.536 0.564 0.610 0.237 0.252 0.486 0.422 0.344

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


SDGNet LSGER-AG 0.745 0.890 0.830 0.300 0.327 0.719 0.572 0.488
SDGNet LSGER-DG 0.792 0.856 0.821 0.301 0.327 0.693 0.561 0.473
SDGPNet LSGER-AG 0.723 0.895 0.844 0.311 0.338 0.719 0.576 0.489
SDGPNet LSGER-DG 0.792 0.853 0.834 0.311 0.336 0.698 0.568 0.478

The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision.

Table 2. The prediction performance of some specific symptoms

TOP@3 TOP@10

ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score

1 Constipation (C0009806) 109/158 1.000 0.019 0.037 0.900 0.057 0.107


2 Nausea (C0027497) 11/17 1.000 0.176 0.300 0.800 0.471 0.593
3 Pain (C0030193) 54/79 1.000 0.038 0.073 0.900 0.114 0.202
4 Usher syndromes (C1568248) 15/18 0.667 0.111 0.190 0.700 0.389 0.500
5 Vision disorders (C0042790) 6/6 1.000 0.500 0.667 0.600 1.000 0.750
6 Aphasia (C0003537) 5/9 0.333 0.111 0.167 0.600 0.667 0.632

Table 3. The top 10 candidate genes of some specific symptoms

Rank Constipation Nausea Pain Usher syndromes Vision disorders Aphasia


(C0009806) (C0027497) (C0030193) (C1568248) (C0042790) (C0003537)

1 SEMA3C ETFDH PROKR1 USH1G TSEN54 LOC643387


2 NRTN ETFB PON3 PDZD7 TSEN2 ATP1A2
3 GPBAR1 ETFA PNOC USH1K TSEN34 PSNP2
4 HMBS LPL DAO USH1H TTPA GRN
5 MLNR HMBS ZNF470 USH1E CLN6 GRIN2A
6 DUOX2 IFNA2 HTR3B CIB2 ATXN7 ADA2
7 TRHR COQ4 NTSR1 CDH23 CNGA3 REEP1
8 MLN SLC7A7 TRPA1 MT-TS2 GRM6 MAPT
9 SCN11A ACADM UNC13A USH1C PRPH2 L1CAM
10 CELIAC8 TNF BDKRB1 WHRN NR2E3 NOTCH3

The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms.

from recent independent publications. For example, Gootwine et al43 state-of-the-art method. The heterogeneous symptom-related net-
verified that the achromatopsia can be caused by the CNGA3 work embedding prediction algorithm that we proposed can make
(rank ¼ 7) mutations. Furthermore, the remaining 3 candidate genes full use of multiple symptom-related information (eg symptom-
GRM6 (rank ¼ 8), PRPH2 (rank ¼ 9) and NR2E3 (rank ¼ 10) were disease, disease-gene and protein-protein associations).
likely to associate with the subtypes of vision disorders, such as night In particular, we integrated the symptom-disease and disease-
blindness,44 visual acuity,45 and enhanced S-cone syndrome.46 gene associations to curate a benchmark dataset of symptom-gene
associations, which can be used to evaluate the performance of the
proposed novel symptom gene prediction algorithms. By systematic
DISCUSSION checking of the symptom terms (more details in SM), we curated a
In real-world clinical settings, symptoms always play an essential high-quality prediction dataset that contains 17 479 symptom-
role in both diagnosis and treatment of diseases. Symptoms are the candidate genes between 461 symptoms and 3620 genes (Supple-
most directly observable manifestations of a disease.47 Therefore, mentary Material S2). The benchmark and prediction datasets of
the investigation of the underlying molecular mechanism of symp- symptom-gene associations can also be used to further investigate
toms has the potential to propel the refinement of disease taxon- the symptom-related molecular mechanisms in experimental set-
omy48 for precision medicine. In this study, we constructed a tings. However, due to the lasting period of curation efforts, the
benchmark dataset of symptom-gene associations and proposed a general “temporal” lag from state-of-the-art publications exists in
heterogeneous symptom-related network embedding prediction most biomedical knowledge databases (eg UMLS and SEMMED).
algorithm for symptom gene prediction. The experimental results in- To address the limitation, we conducted the latest literature manual
dicated our algorithm achieved a significant improvement over the validation to evaluate reliability of the candidate genes.
1458 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


Figure 3. PPI interaction and genetic functional analysis of the predicted genes of pain. We extracted the interaction of the top 49 predicted genes of pain in the
context of the whole protein-protein interaction (PPI) network and showed interaction matrix of them (a), which includes 36 known genes (ie benchmark genes)
and 13 novel candidate genes. There are dense interactions (95 interactions) between those benchmark genes and the novel candidate genes compared to those
with random controls (p-value¼6.82e-68), which indicated that the novel genes were located close to benchmark genes in the PPI network. Further pathway and
Gene Ontology (termed GO) enrichment analysis of the pain predicted genes obtained similar results. The bold and underlined genes are known and candidate
genes of pain, respectively.

Furthermore, the experimental results indicated more informa- genes. The analysis results of the candidate genes of typical symp-
tion fusion can improve prediction performance. Therefore, we will toms indicated that the prediction results have the potential to inves-
consider more heterogeneous data, such as gene ontology and ex- tigate the underlying molecular mechanisms of symptoms in the
pression data in the next efforts. The symptom terms that were experimental settings.
extracted from the UMLS database have hierarchy structures. For
example, as a high-level category, vision disorder is the hypernym of
cataracts (CUI: C0086543), cortical blindness (CUI: C0155320), FUNDING
and night blindness (CUI: C0028077). We will extract and curate a
The work is partially supported by the National Key Research and Develop-
symptom-gene benchmark with hierarchy structures, which can im- ment Program (2017YFC1703506), the Fundamental Research Funds for
pel us to design a more reliable prediction algorithm. In addition, the Central Universities (2017YJS057, 2017JBM020), the Special Programs
the symptom terms from MeSH database are high-quality but with of Traditional Chinese Medicine (201407001, JDZX2015170 and
limited number. Therefore, we need further collection of various JDZX2015171), and the National Key Technology R&D Program
symptom terms contained in the “Clinical Finding” category of (2013BAI02B01 and 2013BAI13B04).
SNOMED49 to expand our dataset. However, the curation of a
high-quality symptom-gene benchmark dataset will always be a sys-
tematic task that needs to be performed continuously. The semantic COMPETING INTERESTS
prediction of SEMMED would be a high-quality resource to curate None.
the benchmark dataset with wide symptom coverage.

CONCLUSION CONTRIBUTORS
X. Z conceived and designed the research. K. Y performed the
Symptom-gene identification is a primary step towards understand-
experiments, analyzed the data, and drafted the manuscript; N. W,
ing the molecular mechanism of symptoms and refining the disease
G. L and R. W were involved in the data curation and analysis; X.
taxonomy in precision medicine. In this study, we curated a bench-
Z, J. C, J. Y and R. Z revised the manuscript. All authors read and
mark dataset of 18 270 symptom-gene associations and proposed a
approved the final manuscript.
heterogeneous symptom-related network embedding representation
algorithm for symptom gene prediction. We compared our method
to the baseline algorithms (FSGER and PRINCE), the results of
which indicated our algorithm achieved a significant improvement.
SUPPLEMENTARY MATERIAL
We also curated a high-quality prediction dataset of 17 479 Supplementary material is available at Journal of the American
symptom-candidate genes that contain 461 symptoms and 3620 Medical Informatics Association online.
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1459

REFERENCES 26. Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relation-
ships among sequence variation and human phenotype. Nucleic Acids Res
1. Li X, Zhou X, Peng Y, et al. Network based integrated analysis of 2014; 42 (Database issue): 980–5.
phenotype-genotype data for prioritization of candidate symptom genes. 27. Welter D, Macarthur J, Morales J, et al. The NHGRI GWAS catalog, a cu-
Biomed Res Int 2014; 2014: 435853. rated resource of sNP-trait associations. Nucleic Acids Res 2014; 42
2. Hofmannapitius M, Alarc onriquelme ME, Chamberlain C, et al. Towards (Database issue): 1001–6.
the taxonomy of human disease. Nature Reviews Drug Discovery 2015; 28. Peter DA, Grondin MC, Robin J, et al. The Comparative Toxicogenomics
14(2): 75–6.

Downloaded from https://academic.oup.com/jamia/article/25/11/1452/5142853 by Technische Informationsbiliothek (TIB) user on 29 November 2023


Database: update 2013. Nucleic Acids Res 2011; 39 (Database issue):
3. Köhler S, Vasilevsky NA, Engelstad M, et al. The human phenotype ontol- 1067–72.
ogy in 2017. Nucleic Acids Res 2017; 45 (D1): D865–76. 29. Menche J, Sharma A, Kitsak M, et al.; Disease networks. Uncovering
4. Kibbe WA, Arze C, Felix V, et al. Disease Ontology 2015 update: an ex- disease-disease relationships through the incomplete interactome. Science
panded and updated database of human diseases for linking biomedical 2015; 347 (6224): 1257601.
knowledge through disease data. Nucleic Acids Res 2015; 43 (D1): D1071. 30. Cowley MJ, Pinese M, Kassahn KS, et al. PINA v2.0: mining interactome
5. Rath A, Olry A, Dhombres F, et al. Representation of rare diseases in modules. Nucleic Acids Res 2012; 40 (Database issue): 862–5.
health information systems: the Orphanet approach to serve a wide range 31. Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc
of end users. Human Mutation 2012; 33(5): 803–8. 2000; 88 (3): 265.
6. Lupski JR, Stankiewicz P. Genomic disorders: molecular mechanisms for 32. Kilicoglu H, Fiszman M, Rodriguez A, et al. Semantic MEDLINE: a web
rearrangements and conveyed phenotypes. Plos Genet 2005; 1 (6): e49. application for managing the results of PubMed searches. Proc Smbm.
7. Zhou H, Skolnick J. A knowledge-based approach for predicting gene- 2008: 69–76.
disease associations. Bioinformatics 2016; 32 (18): 2831–8. 33. Wheeler DL, Church DM, Lash AE, et al. Database resources of the Na-
8. Zeng X, Liao Y, Liu Y, et al. Prediction and validation of disease genes using tional Center for Biotechnology Information: 2002 update. Nucleic Acids
HeteSim Scores. IEEE/ACM Trans Comput Biol Bioinf 2017; 14 (3): 687. Res 2002; 30 (1): 13–16.
9. Locke AE, Kahali B, Berndt SI, et al. Genetic studies of body mass index 34. Fisher RA. On the interpretation of v2 from contingency tables, and the
yield new insights for obesity biology. Nature 2015; 518 (7538): 197–206. calculation of P. J R Stat Soc 1922; 85 (1): 87–94.
10. de Heer EW, Have MT, Hwj VM, et al. Pain as a risk factor for common men- 35. Grover A, Leskovec J. Node2vec: scalable feature learning for networks.
tal disorders. Results from the Netherlands Mental Health Survey and Inci- in proceedings of the 22nd ACM SIGKDD International Conference on
dence Study-2: a longitudinal, population-based study. Pain 2018; 159: 712–8. Knowledge Discovery and Data Mining. 2016. San Francisco, CA, USA.
11. Mccarty CA, Chisholm RL, Chute CG, et al. The eMERGE Network: a 2016:855–864.
consortium of biorepositories linked to electronic medical records data for 36. Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word represen-
conducting genomic studies. BMC Med Genomics 2011; 4 (1): 1–11. tations in vector space. arXiv 2013. (https://arxiv.org/abs/1301.3781v3)
12. Stover PJ, Harlan WR, Hammond JA, et al. PhenX: a toolkit for interdisci- 37. Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online learning of social rep-
plinary genetics research. Curr Opin Lipidol 2010; 21 (2): 136–40. resentations. in ACM SIGKDD International Conference on Knowledge
13. Jyotishman P, Pan H, Wang J, et al. Evaluating phenotypic data elements for Discovery and Data Mining. 2014. New York, NY, USA. 2014: 701–710.
genetics and epidemiological research: experiences from the eMERGE and 38. Vanunu O, Magger O, Ruppin E, et al. Associating genes and protein
PhenX Network Projects. AMIA Jt Summits Transl Sci Proc 2011; 2011: 41–5. complexes with disease via network propagation. Plos Comput Biol 2010;
14. Le DH, Dang VT. Ontology-based disease similarity network for disease 6 (1): e1000641.
gene prediction. Vietnam J Comp Sci 2016; 3 (3): 1–9. 39. Billsus D, Pazzani MJ, Learning collaborative information filters. in pro-
15. Calvo B, L opez-Bigas N, Furney SJ, et al. A partially supervised classifica- ceedings of the 15th International Conference on Machine Learning. San
tion approach to dominant and recessive human disease gene prediction. Francisco, CA, USA. 1998: 46–54.
Comp Methods Progr Biomed 2007; 85 (3): 229–37. 40. Bauer J, Wendland J. Candida albicans Sfl1 suppresses flocculation and fil-
16. Jiang R. Walking on multiple disease-gene networks to prioritize candi- amentation. Eukaryotic Cell 2007; 6 (10): 1736–1744.
date genes. J Mol Cell Biol 2015; 7 (3): 214. 41. Jaworek TJ, Bhatti R, Latief N, et al. USH1K, a novel locus for type I
17. Gonzalezperez S, Pazos F, Chagoyen M. Factors affecting interactome- Usher syndrome, maps to chromosome 10p11.21-q21.1. J Hum Genet
based prediction of human genes associated with clinical signs. BMC Bio- 2012; 57 (10): 633–637.
informatics 2017; 18 (1): 340. 42. Dad S, Østergaard E, Thykjaer T, et al. Identification of a novel locus for a
18. Ada Hamosh AFS, Amberger JS, Bocchini CA, Victor A. McKusick Online USH3 like syndrome combined with congenital cataract. Clin Genet
Mendelian Inheritance in Man (OMIM), a knowledgebase of human 2010; 78 (4): 388–397.
genes and genetic disorders. Nucleic Acids Res 2005; 33 (1): 514–7. 43. Gootwine E, Ofri R, Banin E, et al. Safety and efficacy evaluation of
19. Pinero J, Queralt-Rosinach N, Bravo A, et al. DisGeNET: a discovery plat- rAAV2tYF-PR1.7-hCNGA3 vector delivered by subretinal injection in
form for the dynamical exploration of human diseases and their genes. CNGA3 mutant achromatopsia sheep. Hum Gene Ther Clin Dev 2017;
Database 2015; 2015 (0): bav028. 28: 96–107.
20. Rappaport N, Twik M, Plaschkes I, et al. MalaCards: an amalgamated 44. Ma NG, Ad UI, et al. Mutations in GRM6 identified in consanguineous
human disease compendium with diverse clinical and genetic annotation Pakistani families with congenital stationary night blindness. Mol Vis
and structured search. Nucleic Acids Res 2017; 45 (D1): D877–87. 2015; 21: 1261–1271.
21. Keshava Prasad TS, Goel R, Kandasamy K, et al. Human Protein Reference 45. Chowers I, Tiosano L, Audo I, et al. Adult-onset foveomacular vitelli-
Database–2009 update. Nucleic Acids Res 2009; 37 (Database): D767. form dystrophy: a fresh perspective. Prog Retinal Eye Res 2015; 47:
22. Chatraryamontri A, Breitkreutz BJ, Oughtred R, et al. The BioGRID inter- 64–85.
action database: 2015 update. Nucleic Acids Res 2015; 43(Database is- 46. Kuniyoshi K, Hayashi T, Sakuramoto H, et al. New truncation mutation
sue): D470. of the NR2E3 gene in a Japanese patient with enhanced S-cone syndrome.
23. Orchard S, Ammari M, Aranda B, et al. The MIntAct project—IntAct as a Jpn J Ophthalmol 2016; 60 (6): 476–485.
common curation platform for 11 molecular interaction databases. Nu- 47. Zhou XZ, Menche J, Barab asi A, et al. Human symptoms–disease net-
cleic Acids Res 2014; 42: 358–63. work. Nat Commun 2014; 5: 4212.
24. Apweiler R. Activities at the universal protein resource (UniProt). Nucleic 48. Zhou X, Lei L, Liu J, et al. A systems approach to refine disease taxonomy
Acids Res 2014; 42 (11): 7486. by integrating phenotypic and molecular networks. EBioMedicine 2018;
25. Gutierrez-Sacristan A, Grosdidier S, Valverde O, et al. PsyGeNET: a 31: 79–91.
knowledge platform on psychiatric disorders and their genes. Bioinfor- 49. Donnelly K. SNOMED-CT: the advanced terminology and coding system
matics 2015; 31 (18): 3075–3077. for eHealth. Stud Health Technol Inform 2006; 121 (121): 279.

You might also like