You are on page 1of 8

IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO.

3, JULY 2020 571

Similar Disease Prediction With Heterogeneous


Disease Information Networks
Jianliang Gao , Ling Tian , Jianxin Wang, Yibo Chen, Bo Song , and Xiaohua Hu

Abstract — Studying the similarity of diseases can help have similar drug targets, searching similar diseases could help
us to explore the pathological characteristics of complex drug repositioning [9], which is instrumental in reducing the
diseases, and help provide reliable reference information cost of drug development and clinical trial.
for inferring the relationship between new diseases and
known diseases, so as to develop effective treatment plans. Similar diseases are mainly manifested in the following
To obtain the similarity of the disease, most previous meth- aspects: clinically, they have significantly similar phenotypes
ods either use a single similarity metric such as semantic such as symptoms and signs; genetically, the same gene causes
score, functional score from single data source, or utilize different mutations; and at the molecular level, they have the
weighting coefficients to simply combine multiple metrics same molecular pathway [10]. Therefore, the existence of
with different dimensions. In this paper, we proposes a
method to predict the similarity of diseases by node repre- similar pathogenic mechanisms, similar treatment schemes and
sentation learning. We first integrate the semantic score and the same specific targeted signaling pathways can be used as a
topological score between diseases by combining multiple theoretical basis to determine the similarities in diseases [11].
data sources. Then for each disease, its integrated scores In addition to the disease characteristics, the calculation
with all other diseases are utilized to map it into a vector of disease similarity also relies heavily on disease-related
of the same spatial dimension, and the vectors are used
to measure and comprehensively analyze the similarity biomedical data, which has greatly promoted the development
between diseases. Lastly, we conduct comparative experi- of new bioinformatics technologies. Most of the previous
ment based on benchmark set and other disease nodes out- research mainly chose rich biological ontology data as the
side the benchmark set. Using the statistics such as aver- source of disease terminology for the calculation of dis-
age, variance, and coefficient of variation in the benchmark ease similarity, such as the Disease Ontology (DO) [12],
set to evaluate multiple methods demonstrates the effective-
ness of our approach in the prediction of similar diseases. the Human Phenotype Ontology (HPO), and the Gene Ontol-
ogy (GO) [11], [11]. DO provides biomedical information
Index Terms — Disease similarity, disease prediction, rep- related to human diseases, including the concepts of ratio-
resentation learning, graph.
nal descriptions, phenotypic characteristics, and related med-
ical vocabularies. HPO provides phenotypes and standardized
I. I NTRODUCTION vocabularies that can also reflect human diseases, and its
A. Disease Similarity similarity features can help define criteria for the classification
of similar diseases. For example, some semantic-based meth-
S TUDYING the similarities between diseases can help us
predict disease genes and infer disease associated ncRNAs
[1], [2]. It has been playing an important role in understanding
ods [13]–[15] usually utilize the structure and the semantic
grammar of disease terms (such as DO and MeSH [16])to
the pathogenesis of complex diseases, early prevention and calculate the similarity between diseases; Many function-based
diagnosis of major diseases [3]–[5], providing reliable refer- methods [17]–[19] take advantage of gene functional asso-
ence information in the development and safety assessment ciations [20], [21] to enhance the measurement of disease
of new drugs [6]–[8]. In addition, since similar diseases might similarity, for instance the BOG [22] proposed by Mathur
and Dinakarpandian is based on the similarity of overlapping
Manuscript received April 27, 2020; accepted May 5, 2020. Date of gene sets between DO diseases. However, these methods rely
publication May 25, 2020; date of current version July 1, 2020. This work too much on ontologies when calculating disease similarity.
was supported in part by NSFC under Grant 61873288, in part by
the National Science Foundation (NSF) under Grant 1815256 and Although bio-ontology terms can provide accurate descriptions
Grant 1744661, and in part by the Hunan Key Laboratory for Internet of disease concepts and their semantic relationships, it is not
of Things in Electricity under Grant 2019TP1016. This article was applicable to all situations, as not all diseases have char-
presented in part at the 2019 IEEE International Conference on Bioin-
formatics and Biomedicine. (Corresponding author: Yibo Chen.) acteristics of the ontological structure. For diseases without
Jianliang Gao, Ling Tian, and Jianxin Wang are with the School of ontology, methods mentioned previously become impossible.
Computer Science and Engineering, Central South University, Changsha Using different data sources unrelated to the ontology to
410083, China.
Yibo Chen is with the Information and Communication Branch, State calculate disease similarity can solve this problem.
Grid Hunan Electric Power Company Ltd., Changsha 410000, China However, using different data sources for the task is difficult
(e-mail: chenyibo8224@gmail.com). and uncommon. There exist quite many challenges including
Bo Song and Xiaohua Hu are with the College of Computing &
Informatics, Drexel University, Philadelphia, PA 19104 USA. the lack of comprehensive consideration and evaluation of the
Digital Object Identifier 10.1109/TNB.2020.2994983 similarity measurement. For instance, most methods for the

1536-1241 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
572 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020

measurement of disease similarity leverage either associations C. Contribution


of ontological disease concepts, interconnections of gene To solve the above mentioned problems, we propose a novel
related to each disease, or protein and protein interaction method that has following main contributions:
network [23], [24]. The measurements between a disease with
• We take into account multiple ontology-independent
all other diseases must be done in the same spatial dimension
biomedical data sources and combine the relationship
or the same metric to ensure the consistency and accuracy of
between disease and different biomedical entities for the
the disease prediction. In order to solve the above problems
calculation of disease similarity.
and achieve the goal of making the results of disease prediction
• Our method applys different similarity metrics and
more accurate, we propose in this paper a method to calculate
includes two calculation strategies. One is based on
disease similarity by considering multiple data sources related
semantic similarity calculation in heterogeneous networks
to diseases at the same time. It maps all diseases in the same
where different meta paths can be expressed as different
spatial dimension to construct a disease similarity network and
semantic relationships between diseases. Another is to
predict similar diseases.
exploit the topological characteristics in the networks
B. Motivation of disease and biomedical entities while considering the
multi-hop neighbor nodes.
How to predict similar diseases to a great extent is of • We transform disease nodes into multi-dimensional
great significance to the discovery of the interrelationship real-valued vectors, which allows more effective com-
between diseases and the further study of diseases with similar putation of similarities between different diseases in the
symptoms in the field of biomedicine. Although there exist same vector space. The value of each dimension is of
many studies that work on similarity diseases, there are still practical significance and is derived from the correlation
two important issues need to be addressed: between the diseases.
1) How to Utilize More Available Data Sources to Increase the
• We design various sets of experiments based on bench-
Robustness of Disease Similarity Calculation: It is considered
mark set and other disease nodes outside the bench-
that in addition to the symptoms, pathology, characteristics of
mark set, by adjusting parameters between different data
disease, biomedical data related to the disease can also provide
sources and similarity scores obtained from different
reference for the measurement of similarity between the dis-
calculation metrics. Our extensive experiments and eval-
eases. Some methods mainly use disease ontology to calculate
uations fully demonstrate the effectiveness of the method
disease semantic similarity, or calculate the functional simi-
we proposed in predicting similar diseases.
larity of disease based on the interaction relationship between
disease and disease-related genes or gene ontology. However,
not all biological entities have ontologies, and the formation II. R ELATED W ORK
of many diseases is not related to genes, resulting in a lack Prediction of similar diseases has received much attention in
of interaction information between diseases and genes; Other medical communities. The goal of predicting similar diseases
research works, while using ontology-independent biomedical is to obtain the diseases with the best similarity score from
data, only use a single data source to measure disease similar- networks of multiple diseases and biomedical entities. In order
ity. Unlike most previous methods, we integrate semantic sim- to calculate the best similarity score between diseases, many
ilarity and network-based structural features by utilizing mul- previous methods proposed various algorithms. In this section,
tiple biomedical data sources irrelevant to the ontology to pre- we review the related researches in calculating the similarity
dict similar diseases and conduct comprehensive evaluation. of diseases. Existing methods can be classified into four
2) How to Measure Similarity of Diseases More Objectively: categories.
The calculation of the similarity between each disease and all Methods based on semantic similarity measure the relevance
other diseases must be done in the same spatial dimension to of the ontology terms related to the disease. DOSE [25] is an
ensure the accuracy of the disease prediction. Most previous R package providing semantic similarity computations doSi m
methods either use a single similarity metric to measure among DO terms and genes which allows biologists to explore
the similarity between diseases such as semantic similarity the similarities of diseases and of gene functions in disease
score and functional similarity score; or utilize weighting perspective. Wang [26] presented a method M I S I M for mea-
coefficients to simply combine multiple metrics with different suring miRNA functional similarity based on disease semantic
nature or dimensions to obtain the similarity of the disease. similarity and construction of miRNA functional networks,
This leads to a lack of objectivity in the prediction of similar and the semantic values of diseases were calculated based on
diseases. the Directed Acyclic Graph (DAG) of corresponding diseases.
In order to solve this problem, our proposed method maps Resnic [13] proposed a method based on classification of
multiple metrics of all diseases to the same spatial dimension disease concept assessment. The method also utilizes proba-
at the same time, using a multi-dimensional vector to represent bilistic knowledge to construct a measure of disease semantic
each disease node. The value of each dimension in the similarity calculations, which combines disease semantics with
vector has practical significance derived from the correlation empirical probabilities to achieve higher accuracy. Schlicker
between diseases, and the similarity between the vectors of proposed a method for comparing sets of GO terms and
multi-dimensional real values represents the similarity between for assessing the functional similarity of gene products [27].
diseases. The method mainly relied on two measure metrics of disease

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 573

similarity: si m Rel and f unSi m. si m Rel is used to compare the III. M ETHOD AND A LGORITHM
biological processes of the evolution of different substances, A. System Overview
and f unSi m is used to measure the functionally related gene
In order to achieve the optimal prediction of similar diseases
products in the relevant genome. Although bio-ontology terms
with enhanced supports from multiple data source, we estab-
can provide accurate descriptions of disease concepts and their
lish a similarity scoring function that could reflect com-
semantic relationships for the calculation of disease similarity,
prehensive information from both semantical and structural
it should be considered that not all diseases have an ontology.
aspects of various disease and biomedical entity interaction
For diseases that don’t have an ontology, these methods cannot
networks. Semantical characteristics and topological features
predict similar diseases.
are well quantified and integrated in our overall scoring
Methods based on function similarity [20] compare the
function to guide the calculation process and facilitate the final
disease-related gene sets. Zhang [21] described the construc-
predictions.
tion of the expanded human disease network (eH D N) by
The overall framework of our approach is shown in Fig. 1.
combining the available disease gene information in GAD with
The first step is to construct multiple Heterogeneous Disease
protein and protein interaction network. Process-similarity
information networks (HeDINs) from raw data sources, such
based (PSB) [17] involves the associations based on GO terms
as the relationship between disease and chemical or pathway;
for calculating disease similarity. Mathur proposed a method
The second step is from the constructed HeDINs to obtain
named BOG [22] which computes disease similarity using
the semantic score according to meta path between diseases
Meta Map to map the disease annotation of Swissprot protein
as well as the topological characteristics of each disease, and
entries to DO terms, and estimates the similarity between
combines both of which to jointly build a Homogeneous Dis-
diseases using co-annotation and the DO semantic hierarchy.
ease Similarity Network (HoDSN) composed of only disease
However, existing relationships between disease and genes
nodes; The third step is to convert the disease nodes in the
are rare, and the formation of many diseases is not even
HoDSN to be dense vectors with same spatial dimension
related to genes. Moreover, above mentioned methods only
by representation learning [28]; Finally, in the vector space
use single similarity metric to calculate similarity between
of the same dimension, the vector representation of each
diseases, whereas combining various data sources and multiple
node is used to calculate and predict the similarities among
metrics could obtain more comprehensive and meaningful
diseases.
results.
For the semantic- and function-based method, Cheng [18]
proposed Sem Funsi m method that combines the semantic B. Transforming HeDINs Into HoDSN
similarity and the functional similarity. The Sem Si m (Seman- A heterogeneous information network contains multiple
tic Similarity) is obtained by utilizing semantic association types of objects and links, where the links represent different
from DO, while the FunSi m (Functional Similarity) is calcu- types of connections and relationships between objects [29].
lated from a weighted network of the human gene function From various disease-related data sources that contain multiple
association. The Sem FunSi m is obtained in the end by types of objects (e.g. disease, pathway, and chemicals) and
combining both Sem Si m and FunSi m. However, the values their relationships, we are able to construct the HeDINs
of different dimensions are simply weighted by coefficients accordingly. The HeDINs can be regarded as undirected
for their combination as the similarity of the diseases, which graphs, in which diseases and biomedical entities (chemical,
could result in a lack of objectivity in the prediction of similar pathway, etc.) are represented by different types of nodes,
diseases. and the relationships between diseases and different biomed-
For the methods based on topology, I n f oFlowSi m [19] ical entities represent different semantic paths. For example,
and Net Si m [24] take into account the protein and pro- the disease-chemical network GB = {V B , V B , E B } includes a
tein interaction networks (PPIN) and disease-gene network node set of disease V B and a node set of chemical V B , and
to discover disease-disease associations. They apply infor- E B represents a set of relationships between the disease node
mation Flow from disease to PPIN and random walk and the chemical node. The more detailed definition of het-
with restart (RWR) respectively. However, since PPIN and erogeneous networks and its construction can be found in the
disease-related genes are not fully available, there is still room path S I M [30], which are used for reference as we construct
for improvement in topology-based methods. RADAR [10] the HeDINs from disease-chemical and disease-pathway data
proposed a framework to learn representations of diseases sources.
that captures both of their semantics and structural identities For multiple HeDINs, we obtain the set of total diseases that
for the calculation of disease similarity. Although the method appear in all heterogeneous networks by applying the intersect
combines semantics and topological information, it can not operation on all disease node sets to filter and extract common
calculate disease similarity in multiple disease network with diseases that help construct a HoDSN.
real-time update when a new data source is added. In order to transform multiple HeDINs to a HoDSN, we first
In order to solve the above mentioned problems, we propose define two kinds of scores: the semantic score M(x, y) and
a new approach for measuring disease similarity that combines the topology score T (x, y), where x and y represent any two
semantic similarity and topological similarity, and constructs different disease nodes throughout this paper. For the semantic
disease similarity network by mapping multiple metrics of all score M(x, y), we calculate it through meta path, which has
diseases to the same spatial dimension. following definition:

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
574 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020

Fig. 1. The framework of predicting similar diseases. Firstly, heterogeneous disease information networks (HeDINs) (e.g.  denotes disease,
 denotes pathway, and  denotes chemical) are constructed from raw data resources. In this example, there are two HeDINs GA and GB . The
corresponding subgraphs of these HeDINs are obtained by filtering vertices. Then topological scores and semantics scores are calculated in these
heterogenous subgraphs using Dynamic Time Warping (DTW) algorithm and meta path method respectively. In this way, we transform multiple
HeDINs to a Homogeneous Disease Similarity Network (HoDSN) with different weight values on the edges. Finally, the disease nodes can be
embedded according to the weight values and the similarity between diseases can then calculated using these n-dimensional vectors.

Definition 1 (Meta Path.): A meta path p is a path defined We define the topology score T (x, y) as follows:
by the path length and the types of nodes and edges on the τ
β i ∗DT W (i (x),i (y)))
graph. T (x, y) = e−( i=0 , (2)
For example, (“di sease” → “chemi cal” → “di sease”) is
a meta path with length of three. Intuitively, the semantics where i (·) refers to the degree sequence of the i t h hop
underneath different paths imply different similarities. The neighbor nodes, and i = 0 represents the node itself. β is
semantic score M(x, y) is defined as: a parameter indicates the weight of the neighbor nodes of
different hops. DT W (i (x), i (y)) represents the distance
2 ∗ |{ p x→y | px→y ∈ P}| between the sequence of disease node degrees using Dynamic
M(x, y) = Time Warping algorithm.
|{ p x→x | p x→x ∈ P}| + |{ p y→y | p y→y ∈ P}|
Then, we integrate both semantic and topology scores
(1)
together to be the integrated score, which can be defined as
follows:
where px→y is a meta path instance between disease x and
disease y, px→x is a meta path instance between disease x S(x, y) = α ∗ M(x, y) + (1 − α) ∗ T (x, y), (3)
and disease x, P represents the set of pre-defined meta paths
between disease nodes. M(x, y) refers to a similarity score where α∈[0,1] is a parameter to adjust the contribution of each
between diseases based on semantic. similarity score M and T towards the integrated similarity
Secondly, topological features in the graph are usually score S.
measured by the degree sequence of the nodes. Traditional Finally, we transform multiple incomplete, non-weighted
approaches of sequence similarity calculation usually apply HeDINs into a HoDSN with different weights on the edges.
the algorithm such as Euclidean distance, but cannot solve the The disease similarity network is an information network with
complex case of sequences with different lengths. Dynamic only disease type nodes and can be thought of as a complete
Time Warping [31] is a very effective algorithm that can mea- undirected graph, where the weight of the edge between each
sure similarity between two sequences with different lengths. disease node is generated by the integrated score we calculated

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 575

in the previous step. Algorithm 1 Disease Similarity


|G| Input:

W (x, y) = wk ∗ SGk (x, y), (4) The number of HeDINs, N;
k=1 The set of HeDINs, NG = {Gk = (Vk , Vk , E)|k =
1, . . . , N};
where SGk (x, y) denotes a weight on the edge of two disease Output:
nodes in the k t h G, G ∈ G, and wk represents the proportion Similarity matrix of HoDSN, Si m;
of the k t h G network contributing to the weight of the link. 1: Initialize the set of disease nodes Node_di se = Φ;
2: for each Gk ∈ NG do
C. Node Embedding for Measuring Disease Similarity 3: Node_di se = Node_di se ∩ Vk ;
4: end for
The network representation learning algorithm can learn and
5: Vk = Node_di se of each HeDIN;
generate a vector representation for each node by analyzing
6: for each Gk ∈ NG do
the connections between nodes in the complex information
7: Calculate semantic score M and topological score T for
network, while effectively integrate the network structure and
each pair of diseases in Node_di se;
the external information of the nodes. Upon our constructed
8: Combine two kinds of score, integrated score SGk = α ∗
networks, we further obtain the vector representation of each
M + (1 − α) ∗ T ;
node which can be used for various network application tasks
9: end for
such as node similarity measurement. N
10: W = k=1 ωk ∗ SGk ;
In the process of vectorization of each node, our method
11: Connect any two disease nodes in Node_di se, the weight
orderly encodes disease nodes in the network. The purpose is
of edge is Wx y ;
to get the vector representation of each node according to the
12: Transform multiple HeDINs into HoDSN;
same permutation criterion, so that the value of each dimension
13: Learning node embeddings;
in the multi-dimensional vector generated by each node can
14: Si m ← similarity between all disease nodes;
be compared under the same metric and get better similarity
15: return Si m;
between any disease nodes in the same vector space. Thus,
we embed each disease node as a n-dimensional vector, where
n is the number of nodes in the homogeneous disease network.
For any disease node v i , its vector representation is: CTDbase.1 In our experiments of disease similarity, the two
datasets as Table I include: disease-chemical association and
v i = (W (i, 1), W (i, 2), . . . , W (i, n)), (5) disease-pathway association. The disease-chemical association
includes 6,206 disease nodes, 4,180 chemical nodes, and
where W (i, j ) denotes the weigh value from disease node i 1,048,547 disease-chemical relationships; the disease-pathway
to j . The value of each dimension in the vector is generated association includes 4,997 disease nodes, 2,338 pathway
by the weight of the edge in the HoDSN. nodes, and 569,716 disease-pathway relationships. How-
Since each node can be represented by a vector in the same ever, after filtering and screening the two raw datasets,
dimensional space, the similarity between disease nodes can be the dataset used as the experiment of our method includes
calculated by similarity measurement between vectors, such as 4,986 disease nodes, 4,159 chemical nodes, 2,336 path-
Cosine distance, Euclidean distance, and so on. The similarity way nodes, 1,042,765 disease-chemical relationships, and
score between disease nodes hence can be referred as the 569,642 disease-pathway relationships.
distance between two vectors in n-dimensional space.. Our 2) The Benchmark Set: The benchmark set is composed
method adopt the Euclidean distance on vectors to calculate of 65 disease pairs. Two manually checked datasets of
the similarity of diseases with following equation: disease pairs with high similarity were integrated into the
1 benchmark set, so these disease pairs have been proved as
Si m(x, y) =  (6) similar [32], [33].
1+ (v x − v y ) · (v x − v y )T

where v x , v y are the vector representations of disease nodes B. Experiment Setup


x and y respectively. The normalizing operation is to make The integrated score of each disease node pair is obtained
similarity scores of all diseases fall in the scope of [0,1] for by combining the semantic similarity score and the topological
fair comparison. The larger Si m(x, y) of two diseases means similarity score together on a customizable coefficient α.
the higher similarity between them. The whole framework is We adjusted the parameter α from 0.1 to 0.9 to explore the
summarized in Algorithm 1. influences of different contributions of two kinds of scores on
the experimental results.
IV. E XPERIMENT E VALUATION In order to investigate the effects of different disease-related
data sources on disease similarity results, we also perform
A. Dataset Preparation
experimental comparisons of multiple values of parameter w
1) The Datasets: The datasets about associations of dis-
eases and other entities related diseases is obtained from 1 Datasets are available at http://ctdbase.org/.

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
576 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020

TABLE I
D ATASETS

Fig. 3. Parameter α from multiply similarity score.


Fig. 2. Parameter w from multiply data sources.

from 0.1 to 0.9. To discuss the effectiveness of the proposed changing from 0.1 to 0.9, the AUC value tends to decrease
method, we utilize the disease pairs in the benchmark to gradually, but the overall AUC value can reach 72% or more.
evaluate our method, and we set the range of prediction as 1%, When the value of parameter w1 equal to 0.1 and w2 equal
5%, 10% and 20%, then statistics the correct rate of disease to 0.9, the correct rate on the benchmark calculated by our
prediction in the different ranges. proposed method of disease similarity can reach 85%. In our
experiment of comparing each single data source, the dif-
ference of similarity results obtained by single data source
C. Evaluation Results of chemical-disease and pathway-disease is relatively large.
1) Parameter w for Multiple Disease Information Networks: Therefore, we can conclude from Fig. 2 that by combining
We define parameter w and 1 − w to indicate the effected two or more data sources, the proposed method tend to have
weights of the chemical and pathway on diseases respectively. more stable results.
We conduct the experiments based on the 65 diseases pairs in 2) Parameter α for Integrated Similarity Score: In the
the benchmark. Receiver operating characteristic (ROC) curves research, the contributions of semantic and topological scores
are then drawn with the benchmark set against 50 random sets. are defined by the parameter α and 1 − α for the integrated
Each random set contains 650 randomly selected pairs. The similarity score. We set the parameter α to change from 0.1 to
experimental results are shown in Fig. 2. 0.9, and the experimental results are shown in Fig. 3.
In Fig. 2(a), each color represents a parameter value In Fig. 3, each color represents a parameter value that
of w that varies between [0.1, 0.9]. Each column value in varies between [0.1, 0.9]. Each column value in Fig. 3(b) is
Fig. 2(b) is obtained from the area under the ROC Curve obtained from the area under the ROC Curve (AUC). From
(AUC). From Fig. 2, we can see that: with the parameter value Fig. 3, we can see that in the results of combining two

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: SIMILAR DISEASE PREDICTION WITH HETEROGENEOUS DISEASE INFORMATION NETWORKS 577

TABLE II TABLE III


C OMPARISON R ESULT OF D ISEASE S IMILARITY FOR E ACH M ETHOD ACCURACY OF D ISEASE P REDICTION B ASED ON B ENCHMARK

chemical-disease data source, a pathway-disease data source,


and a dataset that combines the two data sources.
From Table III, we can see that: As the scope of predic-
tion expands, the accuracy increases. Moreover, our proposed
similarity score based on two data sources, our proposed method performs better than a single data source. As the range
method get the best when α = 0.8, and the accuracy reached of predictions expands, the proposed method is better than the
86.8%. However, the difference in the accuracy obtained by disease prediction of only chemical-disease data source, and
the other parameter values is not obvious, so this result proves it is better than or equal to the prediction result of the only
that the weight parameters between the two similar score pathway-disease data source. In summary, the prediction result
calculation indicators are not sensitive in general. of disease similarity with combining multiply data sources
3) Comparison With Other Methods: In order to compare the performs better than a single data source.
effectiveness and superiority of different methods in the task of
measuring disease similarity, we use three metrics to evaluate V. C ONCLUSION
the experimental results. These three metrics are average, To measure the similarity between diseases, we propose
standard deviation, and coefficient of variation. The average is a new approach by utilizing multiple heterogeneous dis-
used to measure the average similarity score of disease pairs ease information networks. In each network, we integrate
in the benchmark set, and the standard deviation reflects the the semantic similarity and the topological similarity. Then,
degree of dispersion of similarity scores between disease pairs. the vector representation learning method is applied to trans-
We apply the coefficient of variation that reflects the degree of form the disease nodes into multi-dimension vectors in the
dispersion on the unit mean to compare the relative degree of same spatial dimension, so that the vectors of the nodes
dispersion among data sets with different mean values, which have ability to represent and reason the relationship between
is more accurate than using only the standard deviation. The diseases. The similarity calculation between vectors is used to
coefficient of variation is the ratio of the standard deviation express the similarity between diseases, which is more effec-
to the average. This indicator is used to explain the balance, tive for finding similar diseases. And our method propose to
rhythm, and stability of things in the process of development apply coefficient of variation for the first time in the evaluation
and change. The smaller the coefficient of variation, the higher of the experimental results for more balanced comparison of
the stability of the method. different methods on the benchmark set, which better explain
Table II shows the evaluation results of comparative experi- the balance and stability of our method in the task of predicting
ments of six methods, where the bold numbers indicate the top similar diseases.
three methods in each metric. We can clearly see that although
the averages of the Resnik [13] and Wang [34] methods are R EFERENCES
high, their standard deviations are more than double to other
[1] W. Lan, J. Wang, M. Li, J. Liu, F.-X. Wu, and Y. Pan, “Predicting
methods; the PSB [17] and SemFunSim [18] methods perform MicroRNA-disease associations based on improved MicroRNA and
stable but their averages of similarity scores are lower than our disease similarities,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 15,
method. For comprehensive consideration, our method has the no. 6, pp. 1774–1782, Dec. 2018.
[2] W. Lan et al., “LDAP: A Web server for lncRNA-disease association
lowest coefficient of variation, which means that the degree of prediction,” Bioinformatics, vol. 33, no. 3, pp. 458–460, Feb. 2017.
dispersion on the unit mean is the lowest. In the calculation [3] O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan, “Associat-
of similarity scores for all disease pairs, our method has better ing genes and protein complexes with disease via network propagation,”
PLOS Comput. Biol., vol. 6, no. 1, pp. 1–9, 2010.
balance and stability than other methods. [4] M. Li, R. Zheng, Q. Li, J. Wang, F.-X. Wu, and Z. Zhang, “Prioritizing
4) Disease Prediction: The similar disease pair set (A − B) disease genes by using search engine algorithm,” Current Bioinf., vol. 11,
in the benchmark is set of one-to-one correspondence between no. 2, pp. 195–202, Apr. 2016.
[5] J. Ni, M. Koyuturk, H. Tong, J. Haines, R. Xu, and X. Zhang, “Disease
nodes in the disease set A and nodes in the disease set B. In the gene prioritization by integrating tissue-specific molecular networks
section, our experiment take disease set A as the set of nodes to using a robust multi-network model,” BMC Bioinf., vol. 17, no. 1,
be predicted, and take disease set B as the reference set. Using pp. 453–466, Dec. 2016.
[6] L. Perlman, A. Gottlieb, N. Atias, E. Ruppin, and R. Sharan, “Com-
our proposed method to calculate the similarity score between bining drug and gene similarity measures for drug-target elucidation,”
any pair of diseases, with reference to the corresponding J. Comput. Biol., vol. 18, no. 2, pp. 133–145, Feb. 2011.
disease nodes in the reference set B, the accuracy of similar [7] A. Gottlieb, G. Y. Stein, E. Ruppin, and R. Sharan, “PREDICT:
A method for inferring novel drug indications with application to
disease predictions in different ranges is counted. Table III personalized medicine,” Mol. Syst. Biol., vol. 7, no. 1, pp. 496–505,
compares the results of similar disease predictions based on a 2011.

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.
578 IEEE TRANSACTIONS ON NANOBIOSCIENCE, VOL. 19, NO. 3, JULY 2020

[8] L. Cheng et al., “DisSim: An online system for exploring significant [23] J. Gao, L. Tian, T. Lv, J. Wang, B. Song, and X. Hu, “Pro-
similar diseases and exhibiting potential therapeutic drugs,” Sci. Rep., tein2Vec: Aligning multiple PPI networks with representation learning,”
vol. 6, no. 1, pp. 30024–30030, Jul. 2016. IEEE/ACM Trans. Comput. Biol. Bioinf., early access, Aug. 27, 2019,
[9] P. Ni, J. Wang, P. Zhong, Y. Li, F. Wu, and Y. Pan, “Constructing disease doi: 10.1109/TCBB.2019.2937771.
similarity networks based on disease module theory,” IEEE/ACM Trans. [24] P. Li, Y. Nie, and J. Yu, “Fusing literature and full network data
Comput. Biol. Bioinf., vol. 17, no. 3, pp. 906–915, May/Jun. 2020. improves disease similarity computation,” BMC Bioinf., vol. 17, no. 1,
[10] R. Qin, L. Duan, H. Zheng, J. Li-Ling, K. Song, and X. Lan, “RADAR: pp. 326–339, Dec. 2016.
Representation learning across disease information networks for similar [25] G. Yu, L.-G. Wang, G.-R. Yan, and Q.-Y. He, “DOSE:
disease detection,” in Proc. IEEE Int. Conf. Bioinf. Biomed. (BIBM), An R/Bioconductor package for disease ontology semantic and
Dec. 2018, pp. 482–487. enrichment analysis,” Bioinformatics, vol. 31, no. 4, pp. 608–609,
[11] M. Ashburner et al., “Gene ontology: Tool for the unification of biology,” Feb. 2015.
Nature Genet., vol. 25, no. 1, pp. 25–29, 2000. [26] D. Wang, J. Wang, M. Lu, F. Song, and Q. Cui, “Inferring the
[12] L. M. Schriml et al., “Disease ontology: A backbone for disease seman- human microRNA functional similarity and functional network based
tic integration,” Nucleic Acids Res., vol. 40, no. D1, pp. D940–D946, on microRNA-associated diseases,” Bioinformatics, vol. 26, no. 13,
Jan. 2012. pp. 1644–1650, Jul. 2010.
[13] P. Resnik, “Using information content to evaluate semantic similarity [27] A. Schlicker, F. S. Domingues, J. Rahnenführer, and T. Lengauer,
in a taxonomy,” in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, “A new measure for functional similarity of gene products based
pp. 448–453. on gene ontology,” BMC Bioinf., vol. 7, no. 1, pp. 302–318,
[14] D. Lin et al., “An information-theoretic definition of similarity,” in Proc. Dec. 2006.
ICML. Los Alamitos, CA, USA: Citeseer, 1998, pp. 296–304. [28] Y. Wang, Y. Yao, H. Tong, F. Xu, and J. Lu, “A brief review of network
[15] S. Bandyopadhyay and K. Mallick, “A new path based hybrid measure embedding,” Big Data Mining Analytics, vol. 2, no. 1, pp. 35–47,
for gene ontology similarity,” IEEE/ACM Trans. Comput. Biol. Bioinf., Mar. 2019.
vol. 11, no. 1, pp. 116–127, Jan. 2014. [29] C. Sun, Q. Li, L. Cui, H. Li, and Y. Shi, “Heterogeneous network-based
[16] C. E. Lipscomb, “Medical subject headings (MeSH),” Bull. Med. Library chronic disease progression mining,” Big Data Mining Analytics, vol. 2,
Assoc., vol. 88, no. 3, pp. 265–266, Jul. 2000. no. 1, pp. 25–34, Mar. 2019.
[17] S. Mathur and D. Dinakarpandian, “Finding disease similarity based [30] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta path-based
on implicit semantic similarity,” J. Biomed. Informat., vol. 45, no. 2, top-K similarity search in heterogeneous information networks,” Proc.
pp. 363–371, Apr. 2012. VLDB Endowment, vol. 4, no. 11, pp. 992–1003, Aug. 2011.
[18] L. Cheng, J. Li, P. Ju, J. Peng, and Y. Wang, “SemFunSim: A new [31] S. Salvador and P. Chan, “Toward accurate dynamic time warping in
method for measuring disease similarity by integrating semantic and linear time and space,” Intell. Data Anal., vol. 11, no. 5, pp. 561–580,
gene functional association,” PLoS ONE, vol. 9, no. 6, pp. 1–11, 2014. Oct. 2007.
[19] M. B. Hamaneh and Y.-K. Yu, “Relating diseases by integrating gene [32] S. Suthram, J. T. Dudley, A. P. Chiang, R. Chen, T. J. Hastie,
associations and information flow through protein interaction network,” and A. J. Butte, “Network-based elucidation of human disease sim-
PLoS ONE, vol. 9, no. 10, pp. 1–14, 2014. ilarities reveals common functional modules enriched for pluripo-
[20] K. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A. L. Barabási, tent drug targets,” PLOS Comput. Biol., vol. 6, no. 2, pp. 1–10,
“The human disease network,” Proc. Nat. Acad. Sci. USA, vol. 104, 2010.
no. 21, pp. 8685–8690, 2007. [33] S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and
[21] X. Zhang et al., “The expanded human disease network combining G. B. Melton, “Semantic similarity and relatedness between clinical
protein–protein interaction information,” Eur. J. Hum. Genet., vol. 19, terms: An experimental study,” in Proc. AMIA Annu. Symp., 2010,
no. 7, pp. 783–788, Jul. 2011. pp. 572–576.
[22] S. Mathur and D. Dinakarpandian, “Automated ontological gene annota- [34] J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C.-F. Chen, “A new
tion for computing disease similarity,” Summit Transl. Bioinf., vol. 2010, method to measure the semantic similarity of GO terms,” Bioinformatics,
pp. 12–16, Aug. 2010. vol. 23, no. 10, pp. 1274–1281, May 2007.

Authorized licensed use limited to: Vasavi College of Engineering. Downloaded on September 21,2023 at 13:18:29 UTC from IEEE Xplore. Restrictions apply.

You might also like