Professional Documents
Culture Documents
Abstract — Studying the similarity of diseases can help have similar drug targets, searching similar diseases could help
us to explore the pathological characteristics of complex drug repositioning [9], which is instrumental in reducing the
diseases, and help provide reliable reference information cost of drug development and clinical trial.
for inferring the relationship between new diseases and
known diseases, so as to develop effective treatment plans. Similar diseases are mainly manifested in the following
To obtain the similarity of the disease, most previous meth- aspects: clinically, they have significantly similar phenotypes
ods either use a single similarity metric such as semantic such as symptoms and signs; genetically, the same gene causes
score, functional score from single data source, or utilize different mutations; and at the molecular level, they have the
weighting coefficients to simply combine multiple metrics same molecular pathway [10]. Therefore, the existence of
with different dimensions. In this paper, we proposes a
method to predict the similarity of diseases by node repre- similar pathogenic mechanisms, similar treatment schemes and
sentation learning. We first integrate the semantic score and the same specific targeted signaling pathways can be used as a
topological score between diseases by combining multiple theoretical basis to determine the similarities in diseases [11].
data sources. Then for each disease, its integrated scores In addition to the disease characteristics, the calculation
with all other diseases are utilized to map it into a vector of disease similarity also relies heavily on disease-related
of the same spatial dimension, and the vectors are used
to measure and comprehensively analyze the similarity biomedical data, which has greatly promoted the development
between diseases. Lastly, we conduct comparative experi- of new bioinformatics technologies. Most of the previous
ment based on benchmark set and other disease nodes out- research mainly chose rich biological ontology data as the
side the benchmark set. Using the statistics such as aver- source of disease terminology for the calculation of dis-
age, variance, and coefficient of variation in the benchmark ease similarity, such as the Disease Ontology (DO) [12],
set to evaluate multiple methods demonstrates the effective-
ness of our approach in the prediction of similar diseases. the Human Phenotype Ontology (HPO), and the Gene Ontol-
ogy (GO) [11], [11]. DO provides biomedical information
Index Terms — Disease similarity, disease prediction, rep- related to human diseases, including the concepts of ratio-
resentation learning, graph.
nal descriptions, phenotypic characteristics, and related med-
ical vocabularies. HPO provides phenotypes and standardized
I. I NTRODUCTION vocabularies that can also reflect human diseases, and its
A. Disease Similarity similarity features can help define criteria for the classification
of similar diseases. For example, some semantic-based meth-
S TUDYING the similarities between diseases can help us
predict disease genes and infer disease associated ncRNAs
[1], [2]. It has been playing an important role in understanding
ods [13]–[15] usually utilize the structure and the semantic
grammar of disease terms (such as DO and MeSH [16])to
the pathogenesis of complex diseases, early prevention and calculate the similarity between diseases; Many function-based
diagnosis of major diseases [3]–[5], providing reliable refer- methods [17]–[19] take advantage of gene functional asso-
ence information in the development and safety assessment ciations [20], [21] to enhance the measurement of disease
of new drugs [6]–[8]. In addition, since similar diseases might similarity, for instance the BOG [22] proposed by Mathur
and Dinakarpandian is based on the similarity of overlapping
Manuscript received April 27, 2020; accepted May 5, 2020. Date of gene sets between DO diseases. However, these methods rely
publication May 25, 2020; date of current version July 1, 2020. This work too much on ontologies when calculating disease similarity.
was supported in part by NSFC under Grant 61873288, in part by
the National Science Foundation (NSF) under Grant 1815256 and Although bio-ontology terms can provide accurate descriptions
Grant 1744661, and in part by the Hunan Key Laboratory for Internet of disease concepts and their semantic relationships, it is not
of Things in Electricity under Grant 2019TP1016. This article was applicable to all situations, as not all diseases have char-
presented in part at the 2019 IEEE International Conference on Bioin-
formatics and Biomedicine. (Corresponding author: Yibo Chen.) acteristics of the ontological structure. For diseases without
Jianliang Gao, Ling Tian, and Jianxin Wang are with the School of ontology, methods mentioned previously become impossible.
Computer Science and Engineering, Central South University, Changsha Using different data sources unrelated to the ontology to
410083, China.
Yibo Chen is with the Information and Communication Branch, State calculate disease similarity can solve this problem.
Grid Hunan Electric Power Company Ltd., Changsha 410000, China However, using different data sources for the task is difficult
(e-mail: chenyibo8224@gmail.com). and uncommon. There exist quite many challenges including
Bo Song and Xiaohua Hu are with the College of Computing &
Informatics, Drexel University, Philadelphia, PA 19104 USA. the lack of comprehensive consideration and evaluation of the
Digital Object Identifier 10.1109/TNB.2020.2994983 similarity measurement. For instance, most methods for the
1536-1241 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
572 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 573
similarity: si m Rel and f unSi m. si m Rel is used to compare the III. M ETHOD AND A LGORITHM
biological processes of the evolution of different substances, A. System Overview
and f unSi m is used to measure the functionally related gene
In order to achieve the optimal prediction of similar diseases
products in the relevant genome. Although bio-ontology terms
with enhanced supports from multiple data source, we estab-
can provide accurate descriptions of disease concepts and their
lish a similarity scoring function that could reflect com-
semantic relationships for the calculation of disease similarity,
prehensive information from both semantical and structural
it should be considered that not all diseases have an ontology.
aspects of various disease and biomedical entity interaction
For diseases that don’t have an ontology, these methods cannot
networks. Semantical characteristics and topological features
predict similar diseases.
are well quantified and integrated in our overall scoring
Methods based on function similarity [20] compare the
function to guide the calculation process and facilitate the final
disease-related gene sets. Zhang [21] described the construc-
predictions.
tion of the expanded human disease network (eH D N) by
The overall framework of our approach is shown in Fig. 1.
combining the available disease gene information in GAD with
The first step is to construct multiple Heterogeneous Disease
protein and protein interaction network. Process-similarity
information networks (HeDINs) from raw data sources, such
based (PSB) [17] involves the associations based on GO terms
as the relationship between disease and chemical or pathway;
for calculating disease similarity. Mathur proposed a method
The second step is from the constructed HeDINs to obtain
named BOG [22] which computes disease similarity using
the semantic score according to meta path between diseases
Meta Map to map the disease annotation of Swissprot protein
as well as the topological characteristics of each disease, and
entries to DO terms, and estimates the similarity between
combines both of which to jointly build a Homogeneous Dis-
diseases using co-annotation and the DO semantic hierarchy.
ease Similarity Network (HoDSN) composed of only disease
However, existing relationships between disease and genes
nodes; The third step is to convert the disease nodes in the
are rare, and the formation of many diseases is not even
HoDSN to be dense vectors with same spatial dimension
related to genes. Moreover, above mentioned methods only
by representation learning [28]; Finally, in the vector space
use single similarity metric to calculate similarity between
of the same dimension, the vector representation of each
diseases, whereas combining various data sources and multiple
node is used to calculate and predict the similarities among
metrics could obtain more comprehensive and meaningful
diseases.
results.
For the semantic- and function-based method, Cheng [18]
proposed Sem Funsi m method that combines the semantic B. Transforming HeDINs Into HoDSN
similarity and the functional similarity. The Sem Si m (Seman- A heterogeneous information network contains multiple
tic Similarity) is obtained by utilizing semantic association types of objects and links, where the links represent different
from DO, while the FunSi m (Functional Similarity) is calcu- types of connections and relationships between objects [29].
lated from a weighted network of the human gene function From various disease-related data sources that contain multiple
association. The Sem FunSi m is obtained in the end by types of objects (e.g. disease, pathway, and chemicals) and
combining both Sem Si m and FunSi m. However, the values their relationships, we are able to construct the HeDINs
of different dimensions are simply weighted by coefficients accordingly. The HeDINs can be regarded as undirected
for their combination as the similarity of the diseases, which graphs, in which diseases and biomedical entities (chemical,
could result in a lack of objectivity in the prediction of similar pathway, etc.) are represented by different types of nodes,
diseases. and the relationships between diseases and different biomed-
For the methods based on topology, I n f oFlowSi m [19] ical entities represent different semantic paths. For example,
and Net Si m [24] take into account the protein and pro- the disease-chemical network GB = {V B , V B , E B } includes a
tein interaction networks (PPIN) and disease-gene network node set of disease V B and a node set of chemical V B , and
to discover disease-disease associations. They apply infor- E B represents a set of relationships between the disease node
mation Flow from disease to PPIN and random walk and the chemical node. The more detailed definition of het-
with restart (RWR) respectively. However, since PPIN and erogeneous networks and its construction can be found in the
disease-related genes are not fully available, there is still room path S I M [30], which are used for reference as we construct
for improvement in topology-based methods. RADAR [10] the HeDINs from disease-chemical and disease-pathway data
proposed a framework to learn representations of diseases sources.
that captures both of their semantics and structural identities For multiple HeDINs, we obtain the set of total diseases that
for the calculation of disease similarity. Although the method appear in all heterogeneous networks by applying the intersect
combines semantics and topological information, it can not operation on all disease node sets to filter and extract common
calculate disease similarity in multiple disease network with diseases that help construct a HoDSN.
real-time update when a new data source is added. In order to transform multiple HeDINs to a HoDSN, we first
In order to solve the above mentioned problems, we propose define two kinds of scores: the semantic score M(x, y) and
a new approach for measuring disease similarity that combines the topology score T (x, y), where x and y represent any two
semantic similarity and topological similarity, and constructs different disease nodes throughout this paper. For the semantic
disease similarity network by mapping multiple metrics of all score M(x, y), we calculate it through meta path, which has
diseases to the same spatial dimension. following definition:
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020
Fig. 1. The framework of predicting similar diseases. Firstly, heterogeneous disease information networks (HeDINs) (e.g. denotes disease,
denotes pathway, and denotes chemical) are constructed from raw data resources. In this example, there are two HeDINs GA and GB . The
corresponding subgraphs of these HeDINs are obtained by filtering vertices. Then topological scores and semantics scores are calculated in these
heterogenous subgraphs using Dynamic Time Warping (DTW) algorithm and meta path method respectively. In this way, we transform multiple
HeDINs to a Homogeneous Disease Similarity Network (HoDSN) with different weight values on the edges. Finally, the disease nodes can be
embedded according to the weight values and the similarity between diseases can then calculated using these n-dimensional vectors.
Definition 1 (Meta Path.): A meta path p is a path defined We define the topology score T (x, y) as follows:
by the path length and the types of nodes and edges on the τ
β i ∗DT W (i (x),i (y)))
graph. T (x, y) = e−( i=0 , (2)
For example, (“di sease” → “chemi cal” → “di sease”) is
a meta path with length of three. Intuitively, the semantics where i (·) refers to the degree sequence of the i t h hop
underneath different paths imply different similarities. The neighbor nodes, and i = 0 represents the node itself. β is
semantic score M(x, y) is defined as: a parameter indicates the weight of the neighbor nodes of
different hops. DT W (i (x), i (y)) represents the distance
2 ∗ |{ p x→y | px→y ∈ P}| between the sequence of disease node degrees using Dynamic
M(x, y) = Time Warping algorithm.
|{ p x→x | p x→x ∈ P}| + |{ p y→y | p y→y ∈ P}|
Then, we integrate both semantic and topology scores
(1)
together to be the integrated score, which can be defined as
follows:
where px→y is a meta path instance between disease x and
disease y, px→x is a meta path instance between disease x S(x, y) = α ∗ M(x, y) + (1 − α) ∗ T (x, y), (3)
and disease x, P represents the set of pre-defined meta paths
between disease nodes. M(x, y) refers to a similarity score where α∈[0,1] is a parameter to adjust the contribution of each
between diseases based on semantic. similarity score M and T towards the integrated similarity
Secondly, topological features in the graph are usually score S.
measured by the degree sequence of the nodes. Traditional Finally, we transform multiple incomplete, non-weighted
approaches of sequence similarity calculation usually apply HeDINs into a HoDSN with different weights on the edges.
the algorithm such as Euclidean distance, but cannot solve the The disease similarity network is an information network with
complex case of sequences with different lengths. Dynamic only disease type nodes and can be thought of as a complete
Time Warping [31] is a very effective algorithm that can mea- undirected graph, where the weight of the edge between each
sure similarity between two sequences with different lengths. disease node is generated by the integrated score we calculated
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 575
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
576 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020
TABLE I
D ATASETS
from 0.1 to 0.9. To discuss the effectiveness of the proposed changing from 0.1 to 0.9, the AUC value tends to decrease
method, we utilize the disease pairs in the benchmark to gradually, but the overall AUC value can reach 72% or more.
evaluate our method, and we set the range of prediction as 1%, When the value of parameter w1 equal to 0.1 and w2 equal
5%, 10% and 20%, then statistics the correct rate of disease to 0.9, the correct rate on the benchmark calculated by our
prediction in the different ranges. proposed method of disease similarity can reach 85%. In our
experiment of comparing each single data source, the dif-
ference of similarity results obtained by single data source
C. Evaluation Results of chemical-disease and pathway-disease is relatively large.
1) Parameter w for Multiple Disease Information Networks: Therefore, we can conclude from Fig. 2 that by combining
We define parameter w and 1 − w to indicate the effected two or more data sources, the proposed method tend to have
weights of the chemical and pathway on diseases respectively. more stable results.
We conduct the experiments based on the 65 diseases pairs in 2) Parameter α for Integrated Similarity Score: In the
the benchmark. Receiver operating characteristic (ROC) curves research, the contributions of semantic and topological scores
are then drawn with the benchmark set against 50 random sets. are defined by the parameter α and 1 − α for the integrated
Each random set contains 650 randomly selected pairs. The similarity score. We set the parameter α to change from 0.1 to
experimental results are shown in Fig. 2. 0.9, and the experimental results are shown in Fig. 3.
In Fig. 2(a), each color represents a parameter value In Fig. 3, each color represents a parameter value that
of w that varies between [0.1, 0.9]. Each column value in varies between [0.1, 0.9]. Each column value in Fig. 3(b) is
Fig. 2(b) is obtained from the area under the ROC Curve obtained from the area under the ROC Curve (AUC). From
(AUC). From Fig. 2, we can see that: with the parameter value Fig. 3, we can see that in the results of combining two
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 577
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
578 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020
[8] L. Cheng et al., “DisSim: An online system for exploring significant [23] J. Gao, L. Tian, T. Lv, J. Wang, B. Song, and X. Hu, “Pro-
similar diseases and exhibiting potential therapeutic drugs,” Sci. Rep., tein2Vec: Aligning multiple PPI networks with representation learning,”
vol. 6, no. 1, pp. 30024–30030, Jul. 2016. IEEE/ACM Trans. Comput. Biol. Bioinf., early access, Aug. 27, 2019,
[9] P. Ni, J. Wang, P. Zhong, Y. Li, F. Wu, and Y. Pan, “Constructing disease doi: 10.1109/TCBB.2019.2937771.
similarity networks based on disease module theory,” IEEE/ACM Trans. [24] P. Li, Y. Nie, and J. Yu, “Fusing literature and full network data
Comput. Biol. Bioinf., vol. 17, no. 3, pp. 906–915, May/Jun. 2020. improves disease similarity computation,” BMC Bioinf., vol. 17, no. 1,
[10] R. Qin, L. Duan, H. Zheng, J. Li-Ling, K. Song, and X. Lan, “RADAR: pp. 326–339, Dec. 2016.
Representation learning across disease information networks for similar [25] G. Yu, L.-G. Wang, G.-R. Yan, and Q.-Y. He, “DOSE:
disease detection,” in Proc. IEEE Int. Conf. Bioinf. Biomed. (BIBM), An R/Bioconductor package for disease ontology semantic and
Dec. 2018, pp. 482–487. enrichment analysis,” Bioinformatics, vol. 31, no. 4, pp. 608–609,
[11] M. Ashburner et al., “Gene ontology: Tool for the unification of biology,” Feb. 2015.
Nature Genet., vol. 25, no. 1, pp. 25–29, 2000. [26] D. Wang, J. Wang, M. Lu, F. Song, and Q. Cui, “Inferring the
[12] L. M. Schriml et al., “Disease ontology: A backbone for disease seman- human microRNA functional similarity and functional network based
tic integration,” Nucleic Acids Res., vol. 40, no. D1, pp. D940–D946, on microRNA-associated diseases,” Bioinformatics, vol. 26, no. 13,
Jan. 2012. pp. 1644–1650, Jul. 2010.
[13] P. Resnik, “Using information content to evaluate semantic similarity [27] A. Schlicker, F. S. Domingues, J. Rahnenführer, and T. Lengauer,
in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, “A new measure for functional similarity of gene products based
pp. 448–453. on gene ontology,” BMC Bioinf., vol. 7, no. 1, pp. 302–318,
[14] D. Lin et al., “An information-theoretic definition of similarity,” in Proc. Dec. 2006.
ICML. Los Alamitos, CA, USA: Citeseer, 1998, pp. 296–304. [28] Y. Wang, Y. Yao, H. Tong, F. Xu, and J. Lu, “A brief review of network
[15] S. Bandyopadhyay and K. Mallick, “A new path based hybrid measure embedding,” Big Data Mining Analytics, vol. 2, no. 1, pp. 35–47,
for gene ontology similarity,” IEEE/ACM Trans. Comput. Biol. Bioinf., Mar. 2019.
vol. 11, no. 1, pp. 116–127, Jan. 2014. [29] C. Sun, Q. Li, L. Cui, H. Li, and Y. Shi, “Heterogeneous network-based
[16] C. E. Lipscomb, “Medical subject headings (MeSH),” Bull. Med. Library chronic disease progression mining,” Big Data Mining Analytics, vol. 2,
Assoc., vol. 88, no. 3, pp. 265–266, Jul. 2000. no. 1, pp. 25–34, Mar. 2019.
[17] S. Mathur and D. Dinakarpandian, “Finding disease similarity based [30] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta path-based
on implicit semantic similarity,” J. Biomed. Informat., vol. 45, no. 2, top-K similarity search in heterogeneous information networks,” Proc.
pp. 363–371, Apr. 2012. VLDB Endowment, vol. 4, no. 11, pp. 992–1003, Aug. 2011.
[18] L. Cheng, J. Li, P. Ju, J. Peng, and Y. Wang, “SemFunSim: A new [31] S. Salvador and P. Chan, “Toward accurate dynamic time warping in
method for measuring disease similarity by integrating semantic and linear time and space,” Intell. Data Anal., vol. 11, no. 5, pp. 561–580,
gene functional association,” PLoS ONE, vol. 9, no. 6, pp. 1–11, 2014. Oct. 2007.
[19] M. B. Hamaneh and Y.-K. Yu, “Relating diseases by integrating gene [32] S. Suthram, J. T. Dudley, A. P. Chiang, R. Chen, T. J. Hastie,
associations and information flow through protein interaction network,” and A. J. Butte, “Network-based elucidation of human disease sim-
PLoS ONE, vol. 9, no. 10, pp. 1–14, 2014. ilarities reveals common functional modules enriched for pluripo-
[20] K. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A. L. Barabási, tent drug targets,” PLOS Comput. Biol., vol. 6, no. 2, pp. 1–10,
“The human disease network,” Proc. Nat. Acad. Sci. USA, vol. 104, 2010.
no. 21, pp. 8685–8690, 2007. [33] S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and
[21] X. Zhang et al., “The expanded human disease network combining G. B. Melton, “Semantic similarity and relatedness between clinical
protein–protein interaction information,” Eur. J. Hum. Genet., vol. 19, terms: An experimental study,” in Proc. AMIA Annu. Symp., 2010,
no. 7, pp. 783–788, Jul. 2011. pp. 572–576.
[22] S. Mathur and D. Dinakarpandian, “Automated ontological gene annota- [34] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen, “A new
tion for computing disease similarity,” Summit Transl. Bioinf., vol. 2010, method to measure the semantic similarity of GO terms,” Bioinformatics,
pp. 12–16, Aug. 2010. vol. 23, no. 10, pp. 1274–1281, May 2007.
Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.