Professional Documents
Culture Documents
doi: 10.1093/jamia/ocy117
Advance Access Publication Date: 23 October 2018
Research and Applications
Corresponding Author: Xuezhong Zhou, PhD, Beijing Jiaotong University, No.3 Shangyuancun, Haidian District, Beijing,
100044 (xzzhou@bjtu.edu.cn)
Received 8 March 2018; Revised 24 July 2018; Editorial Decision 9 August 2018; Accepted 11 August 2018
ABSTRACT
Objective: Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine
disease taxonomy and improve the personalized management of chronic diseases. Although there are abun-
dant experimental studies and computational efforts to obtain the candidate genes of diseases, the identifica-
tion of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene
associations and proposed a heterogeneous network embedding for identifying symptom genes.
Methods: We proposed a heterogeneous network embedding representation algorithm, which constructed a
heterogeneous symptom-related network that integrated symptom-related associations and applied an embed-
ding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the
relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of
given symptoms can be obtained.
Results: A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes
was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results
indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision
and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and associ-
ation precision improved by 37.71% (0.723 vs 0.525) over the PRINCE.
Conclusions: The experimental validation of the algorithms and the literature validation of typical symptoms
indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479
symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations
of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings.
Key words: heterogeneous network embedding, symptom gene identification, network medicine
C The Author(s) 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association.
V
All rights reserved. For permissions, please email: journals.permissions@oup.com
1452
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1453
understanding the molecular mechanism of disease phenotypes.6–8 The curated disease-gene associations between 13 074 diseases with
investigation of the underlying molecular mechanisms of symptom UMLS code (CUI) and 8947 genes from the DisGeNet database,
phenotypes has rarely been addressed, except for disease conditions which integrates disease-gene associations from UniProt,24 PsyGe-
overlapping with symptom phenotypes, such as obesity9 and pain.10 In NET,25 ClinVar,26 Orphanet,5 the GWAS Catalog,27 CTD28 and
addition, to impel the study of genome and phenotypes, the U.S. Na- HPO3 databases. Second, we collected 73 064 disease-gene associa-
tional Human Genome Research Institute initiated 2 projects, tions between 6118 diseases with CUIs and 8370 genes from the
eMERGE,11 which correlates whole genome scans with phenotype data Malacards database. To unify and integrate the disease terms, we
Figure 2. A flow chart of data collection and integration. First, 87 442 disease-symptom associations were collected by integrating disease-symptom associations
from the DO, HPO and Orphanet databases. We collected and integrated 196 397 disease-gene associations from the DisGeNet and Malacards databases. Then,
we selected a set of 1278 symptoms with DP characteristics from the MeSH database and the integrated associations. Finally, a benchmark dataset of 18 270
symptom-gene associations was curated.
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1455
and finally obtained 50 907 symptom-gene associations (called feature space, the low-dimensional vector features of nodes can be
SPSG) between 932 symptoms and 9382 genes. measured using stochastic gradient ascent over the model.
We constructed 2 heterogeneous networks, SDGNet, which inte-
grated symptom-disease and disease-gene associations and
Fisher-based statistics model for symptom gene
SDGPNet, which integrated symptom-disease, disease-gene, and
prediction
protein-protein associations. Given a heterogeneous network
Based on the Fisher exact test,34 we proposed a Fisher-based statisti-
G ¼ ðV; EÞ, V and E represented the nodes and edges of the net-
TOP@3 TOP@10
The bold values represent best performance for each metrics (e.g. AUC, precision and recall). AP represents association precision.
TOP@3 TOP@10
ID Symptom (CUI) Number of hit genes/test genes Precision Recall F1-score Precision Recall F1-score
The genes with bold fonts represent the candiate genes that are known genes of the corresponding symptoms.
from recent independent publications. For example, Gootwine et al43 state-of-the-art method. The heterogeneous symptom-related net-
verified that the achromatopsia can be caused by the CNGA3 work embedding prediction algorithm that we proposed can make
(rank ¼ 7) mutations. Furthermore, the remaining 3 candidate genes full use of multiple symptom-related information (eg symptom-
GRM6 (rank ¼ 8), PRPH2 (rank ¼ 9) and NR2E3 (rank ¼ 10) were disease, disease-gene and protein-protein associations).
likely to associate with the subtypes of vision disorders, such as night In particular, we integrated the symptom-disease and disease-
blindness,44 visual acuity,45 and enhanced S-cone syndrome.46 gene associations to curate a benchmark dataset of symptom-gene
associations, which can be used to evaluate the performance of the
proposed novel symptom gene prediction algorithms. By systematic
DISCUSSION checking of the symptom terms (more details in SM), we curated a
In real-world clinical settings, symptoms always play an essential high-quality prediction dataset that contains 17 479 symptom-
role in both diagnosis and treatment of diseases. Symptoms are the candidate genes between 461 symptoms and 3620 genes (Supple-
most directly observable manifestations of a disease.47 Therefore, mentary Material S2). The benchmark and prediction datasets of
the investigation of the underlying molecular mechanism of symp- symptom-gene associations can also be used to further investigate
toms has the potential to propel the refinement of disease taxon- the symptom-related molecular mechanisms in experimental set-
omy48 for precision medicine. In this study, we constructed a tings. However, due to the lasting period of curation efforts, the
benchmark dataset of symptom-gene associations and proposed a general “temporal” lag from state-of-the-art publications exists in
heterogeneous symptom-related network embedding prediction most biomedical knowledge databases (eg UMLS and SEMMED).
algorithm for symptom gene prediction. The experimental results in- To address the limitation, we conducted the latest literature manual
dicated our algorithm achieved a significant improvement over the validation to evaluate reliability of the candidate genes.
1458 Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11
Furthermore, the experimental results indicated more informa- genes. The analysis results of the candidate genes of typical symp-
tion fusion can improve prediction performance. Therefore, we will toms indicated that the prediction results have the potential to inves-
consider more heterogeneous data, such as gene ontology and ex- tigate the underlying molecular mechanisms of symptoms in the
pression data in the next efforts. The symptom terms that were experimental settings.
extracted from the UMLS database have hierarchy structures. For
example, as a high-level category, vision disorder is the hypernym of
cataracts (CUI: C0086543), cortical blindness (CUI: C0155320), FUNDING
and night blindness (CUI: C0028077). We will extract and curate a
The work is partially supported by the National Key Research and Develop-
symptom-gene benchmark with hierarchy structures, which can im- ment Program (2017YFC1703506), the Fundamental Research Funds for
pel us to design a more reliable prediction algorithm. In addition, the Central Universities (2017YJS057, 2017JBM020), the Special Programs
the symptom terms from MeSH database are high-quality but with of Traditional Chinese Medicine (201407001, JDZX2015170 and
limited number. Therefore, we need further collection of various JDZX2015171), and the National Key Technology R&D Program
symptom terms contained in the “Clinical Finding” category of (2013BAI02B01 and 2013BAI13B04).
SNOMED49 to expand our dataset. However, the curation of a
high-quality symptom-gene benchmark dataset will always be a sys-
tematic task that needs to be performed continuously. The semantic COMPETING INTERESTS
prediction of SEMMED would be a high-quality resource to curate None.
the benchmark dataset with wide symptom coverage.
CONCLUSION CONTRIBUTORS
X. Z conceived and designed the research. K. Y performed the
Symptom-gene identification is a primary step towards understand-
experiments, analyzed the data, and drafted the manuscript; N. W,
ing the molecular mechanism of symptoms and refining the disease
G. L and R. W were involved in the data curation and analysis; X.
taxonomy in precision medicine. In this study, we curated a bench-
Z, J. C, J. Y and R. Z revised the manuscript. All authors read and
mark dataset of 18 270 symptom-gene associations and proposed a
approved the final manuscript.
heterogeneous symptom-related network embedding representation
algorithm for symptom gene prediction. We compared our method
to the baseline algorithms (FSGER and PRINCE), the results of
which indicated our algorithm achieved a significant improvement.
SUPPLEMENTARY MATERIAL
We also curated a high-quality prediction dataset of 17 479 Supplementary material is available at Journal of the American
symptom-candidate genes that contain 461 symptoms and 3620 Medical Informatics Association online.
Journal of the American Medical Informatics Association, 2018, Vol. 25, No. 11 1459
REFERENCES 26. Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relation-
ships among sequence variation and human phenotype. Nucleic Acids Res
1. Li X, Zhou X, Peng Y, et al. Network based integrated analysis of 2014; 42 (Database issue): 980–5.
phenotype-genotype data for prioritization of candidate symptom genes. 27. Welter D, Macarthur J, Morales J, et al. The NHGRI GWAS catalog, a cu-
Biomed Res Int 2014; 2014: 435853. rated resource of sNP-trait associations. Nucleic Acids Res 2014; 42
2. Hofmannapitius M, Alarc onriquelme ME, Chamberlain C, et al. Towards (Database issue): 1001–6.
the taxonomy of human disease. Nature Reviews Drug Discovery 2015; 28. Peter DA, Grondin MC, Robin J, et al. The Comparative Toxicogenomics
14(2): 75–6.