You are on page 1of 18

Journal of Bioinformatics and Computational Biology

Vol. 17, No. 4 (2019) 1950020 (18 pages)


.c World Scienti¯c Publishing Europe Ltd.
#
DOI: 10.1142/S0219720019500203

Inferring disease and pathway associations of long non-coding


RNAs using heterogeneous information network model

P. V. Sunil Kumar* and G. Gopakumar†


Department of Computer Science and Engineering
National Institute of Technology Calicut
Kozhikkode, Kerala 673601, India
*sunilkumar _p130073cs@nitc.ac.in

gopakumarg@nitc.ac.in

Received 27 December 2018


Accepted 29 April 2019
Published 29 August 2019

Recent ¯ndings from biological experiments demonstrate that long non-coding RNAs
(lncRNAs) are actively involved in critical cellular processes and are associated with innu-
merable diseases. Computational prediction of lncRNA–disease association draws tremendous
research attention nowadays. This paper proposes a machine learning model that predicts
lncRNA–disease associations using Heterogeneous Information Network (HIN) of lncRNAs
and diseases. A Support Vector Machine classi¯er is developed using the feature set extracted
from a meta-path-based parameter, Association Index derived from the HIN. Performance of
the model is validated using standard statistical metrics and it generated an AUC value of
0.87, which is better than the existing methods in the literature. Results are further validated
using the recent literature and many of the predicted lncRNA–disease associations are iden-
ti¯ed as actually existing. This paper also proposes an HIN-based methodology to associate
lncRNAs with pathways in which they may have biological in°uence. A case study on the
pathway associations of four well-known lncRNAs (HOTAIR, TUG1, NEAT1, and
MALAT1) has been conducted. It has been observed that many times the same lncRNA is
associated with more than one biologically related pathways. Further exploration is needed to
substantiate whether such lncRNAs have any role in determining the pathway interplay. The
script and sample data for the model construction is freely available at http://bdbl.nitc.ac.in/
LncDisPath/index.html.

Keywords: LncRNA–disease association; lncRNA–pathway interaction; heterogeneous infor-


mation network; meta-path; support vector machine.

1. Introduction
Accumulating evidence identify long non-coding RNAs (lncRNA) as active players in
a variety of biological mechanisms and numerous diseases. An RNA molecule with
very low protein coding potential is known as ncRNA and ncRNAs longer than 200
nucleotides are LncRNAs.1 As per recent studies, lncRNAs are actively involved in
processes such as chromatin modi¯cation, transcriptional, post-transcriptional and

1950020-1
P. V. Sunil Kumar & G. Gopakumar

epigenetic regulation, imprinting, cell proliferation, survival, and many more.2–4


Moreover, a myriad of clinical validations linking lncRNAs with human diseases are
reported recently.1 They correlate mutations and dysregulations of lncRNAs with
human diseases such as cardiovascular diseases, neurodegenerative disorders, and
various cancer types.3,5 Since the complete annotation of lncRNAs is slow progres-
sing, the number of disease-associated lncRNAs is less than 1% of the total spectrum
of identi¯ed lncRNAs.1
Identi¯cation of disease associations of lncRNAs has already drawn substantial
research attention. Diagnosis and therapeutics have received fresh impetus ever since
lncRNAs were proven as potential disease biomarkers.5 Additionally, lncRNA–based
strategies assist quick interpretation of molecular mechanisms of diseases.3 Full
functional annotation of lncRNAs is a prerequisite to establish their disease asso-
ciations. The yawning gap between discovery rate and annotation rate of lncRNAs
causes experimental determination of lncRNA–disease associations tedious and ex-
pensive. Therefore, only a small fraction of the lncRNA–disease associations have
been reported till date2,3 and the computational prediction of lncRNA–disease
associations is emerging as a promising research direction.
Noteworthy attempts are already in place for computational prediction of
lncRNA–diseases associations. KATZLDA7 by Chen et al. and the e®ort by Liu
et al.8 used biological information of lncRNAs to mine disease associations.
LRLSLDA9 and LNCSIM10 applied machine learning techniques for the same pur-
pose. Methods by Sun et al.11 and Chen et al.12 used random walk with restart to
infer LncRNA–disease associations. Gu et al.5 proposed GrwLDA, which is a random
walk-based approach. Yao et al.13 proposed a multilevel composite network-based
approach, LncPriCNet. The method in Ref. 2 uses lncRNA–disease–gene tripartite
graph to predict lncRNA–disease associations. Scheme proposed by Zhang et al.3
integrates multiple heterogeneous networks of lncRNA, disease, and protein and Lu
et al.6 use the notion of inductive matrix completion to derive lncRNA–disease
associations.
Most of the aforementioned methods are interested in lncRNA–disease associa-
tions. Other disease-related arenas of computational biology (such as gene/protein–
disease interactions, disease–drug associations) have enhanced their dimension of
research to one level further. They visualize the cause, e®ect, etiology, and clinical
manifestations of human diseases in terms of pathways. This is justi¯able because
disease-causing genes/proteins function as a system of mutual interactions called
pathways.14 Moreover, diseases are interconnected through di®erent pathways and
do not stay isolated.15 At this point, lncRNA research su®ers from a serious shortfall
that it has not yet extended its bounds to pathways. A notable previous work that
connects lncRNAs to pathways is proposed by Han et al.16 which connect pathways
to a selected set of lncRNAs. This method is not generic enough and works for a
prede¯ned input set of query lncRNAs.
Motivated by these gaps in the literature, we introduce a new methodology for
the prediction of lncRNA–disease–pathway association, which e®ectively utilizes the

1950020-2
Inferring disease and pathway associations of LncRNAs

Heterogeneous Information Network (HIN) principles. An LncRNA–disease HIN


(LDHIN) is constructed by mapping the existing associations of lncRNAs with dis-
eases. A new meta-path-based algorithm is proposed to derive feature set from the
LDHIN. This feature set is used to train a Support Vector Machine (SVM) classi¯er
which predicts novel lncRNA–disease associations. Afterwards, a study to establish
the pathway associations of lncRNAs using disease associations and gene set en-
richment is conducted. This work connects lncRNAs and pathways in two steps:
lncRNA–disease and disease–pathway. From the literature mining, it is revealed that
many of the lncRNA–pathway associations determined by our method are experi-
mentally reported. Upon validation with statistical parameters, the results are found
to be promising.
The rest of the paper is organized as follows: Sec. 2 includes the details of data
preparation and work °ow of the proposed model. Section 3 presents the experi-
mental results, the extracted lncRNA–diseases associations and lncRNA–pathway
associations. Additionally, it includes a case study of lncRNAs associated with three
common diseases. A brief discussion on the various aspects and biological implica-
tions of the results is provided in Sec. 4. Section 5 outlines the concluding remarks
and future directions.

2. Materials and Methods


This section details the preparation of data and systematic procedures adopted for
the experiment. Since HIN acts as the tool for study, formal de¯nitions of the pivotal
concepts of HIN are presented ¯rst.

2.1. Preliminaries
Formal de¯nitions associated with HIN that are used further in the paper are
de¯ned here.
Heterogeneous Information Network17,18:
An Information Network is de¯ned as a directed graph G ¼ ðV ; EÞ with an object-
type mapping function  : V ! T and a link-type mapping function : E ! R,
where each object v 2 V belongs to one particular object type ðvÞ 2 T , each link
e 2 E belongs to a particular relation ðeÞ 2 R, and if two links belong to the same
relation type, the two links share the same starting object type as well as the ending
object type.
The information network is called HIN if the number of types of relations jRj > 1
or the number of types of objects jT j > 1.
Network Schema17,18:
The network schema, denoted as SG ¼ ðT ; RÞ, is a meta-template for an information
network G ¼ ðV ; EÞ with the object-type mapping  : V ! T and the link-type
mapping : E ! R, which is a directed graph de¯ned over object types T , with

1950020-3
P. V. Sunil Kumar & G. Gopakumar

edges as relations from R. The network schema of an HIN speci¯es type constraints
on the sets of objects and relationships among the objects.
Meta-Path17,18:
A meta-path P is a path de¯ned on a schema SG ¼ ðT ; RÞ, and is denoted in the form
T1 !R1 T2 !R2    !Rl Tlþ1 which de¯nes a composite relation R ¼ R1  R2     
Rl in objects T1 ; T2 ; . . . ; Tlþ1 , where  denotes the composition operator on relations.

2.2. Work°ow of the proposed method


The work°ow of the proposed lncRNA–pathway association prediction consists of
¯ve phases: (a) data preparation, (b) construction of LDHIN, (c) computation of the
proposed meta-path-based parameter, Association Index, (d) identi¯cation of novel
lncRNA–disease associations, and (e) identi¯cation of novel lncRNA–pathway
associations. A schematic diagram of the work°ow is shown in Fig. 1.

Fig. 1. The work°ow of the proposed method consists of ¯ve phases. The novel lncRNA–diseases
associations and lncRNA–pathway associations are the end results.

1950020-4
Inferring disease and pathway associations of LncRNAs

2.2.1. Data preparation


Existing lncRNA–disease associations are the input to the proposed methodology.
These are taken from the lncRNADisease,19 which is an open and curated repository
for lncRNA–disease association data from high quality experiments. The 2017 release
consists of a total of 2946 interactions of 888 lncRNAs and 328 diseases spread across
three di®erent organisms. In our experiment, the associations are limited to homo-
sapiens with 2907 experimentally supported interactions of 828 lncRNAs and 314
diseases.
In order to construct the HIN of lncRNAs and diseases, the ontology identi¯ers
(DOID) of all the 314 diseases are taken from Disease Ontology,20 an open source
ontology for the integration of biomedical data that are associated with human
disease. The diseases for which the ontology data are not available are discarded.
This resulted in the exclusion of 173 diseases. The ¯nal dataset has 1115 interactions
of 598 lncRNAs and 155 diseases for which DOIDs are available.

2.2.2. Construction of LDHIN


The LDHIN is constructed by mapping lncRNAs and diseases to objects and their
mutual interactions to links of the HIN. Additional links such as lncRNA–lncRNA
and disease–disease are incorporated using similarity scores.
LncRNA–lncRNA similarity score11: To determine the functional similarity
score between lncRNAs, the similarity of a disease with a disease group is a pre-
requisite. If d be any disease and D ¼ fd1 ; d2 ; . . . ; dk g be any disease group, then the
similarity of the disease with the group is de¯ned as
SIMðd; DÞ ¼ max Simðd; dj Þ;
1jk

where Simðd; dj Þ is the similarity score between the diseases d and dj .


Now, let an lncRNA l1 is associated with a set D1 ¼ fd1 ; d2 ; . . . ; dm g of m diseases,
and another lncRNA l2 with set D2 ¼ fd1 ; d2 ; . . . ; dn g of n diseases. The functional
similarity score between l1 and l2 is estimated as
P P
1im SIMðd1i ; D2 Þ þ 1jn SIMðd2j ; D1 Þ
LncSimðl1 ; l2 Þ ¼ ;
mþn
where d1i 2 D1 and d2j 2 D2 .
Disease–disease similarity score11: A pair of diseases are considered to be as-
sociated if they are semantically similar. The similarity score between a pair of
diseases is computed from their ontology terms using the doSim function from DOSE
package in R.21
Drawing links in LDHIN: The lncRNA–disease links are drawn using the dataset
prepared. To draw lncRNA–lncRNA and diseases–disease links, their respective
similarity scores are used. The concept of cumulative frequency distribution is

1950020-5
P. V. Sunil Kumar & G. Gopakumar

(a) for lncRNAs (b) for diseases

Fig. 2. Cumulative frequency distribution of links between lncRNAs and diseases. For lncRNA links, the
cumulative frequency retained stability at a similarity score of 0.5 and for disease links at 0.4.

applied to regularize the number of links between lncRNAs and diseases. The cu-
mulative frequency distribution of number of links for lncRNAs and diseases for
various similarity cuto®s are depicted in Figs. 2(a) and 2(b), respectively. It can be
observed that the number of links between lncRNAs and diseases retain stability at
similarity scores of 0.5 and 0.4, respectively. Therefore in the LDHIN, an edge is
drawn between a pair of lncRNAs if the functional similarity score is equal or above
0.5, and for a pair of diseases if the score is 0.4 or above.
The details of objects and links in the LDHIN under study are presented in
Table 1 and a portion of LDHIN is shown in supplementary Fig. 1.

2.2.3. Computation of Association Index (A)


The proposed Association Index (A) is used to construct the feature set for the SVM
classi¯er that performs the prediction. Association Index (A) estimates the strength
of association between an lncRNA and a disease in LDHIN. It is derived from two
proposed parameters: Path Count and Path Index, which are de¯ned as follows:
Path Count (C): It is de¯ned as the number of meta-paths of type Pk between
lncRNAi and diseasej . For LDHIN, path count can be represented as a

Table 1. Details of objects and links in the LDHIN.

#Objects #Links
LncRNA (L) Diseases (D) LD LL DD
598 155 1115 32026 1777

1950020-6
Inferring disease and pathway associations of LncRNAs

three-dimensional matrix:
C½i; j; k ¼ Number of Pk meta-paths between lncRNAi and diseasej in LDHIN:
Path Index (I ): Path index measures the ratio of meta-paths of a certain type to all
possible meta-paths in an lncRNA–disease pair. This parameter is de¯ned to esti-
mate in°uence of a particular meta-path on a speci¯c lncRNA–disease pair.
Extending to LDHIN, the three-dimensional path index matrix is obtained as
C½i; j; k
I ½i; j; k ¼
minðdegðlncRNAi Þ; degðdiseasej ÞÞ
where degðxÞ represents the degree of an object x in LDHIN. Obviously, isolated
objects (degðxÞ ¼ 0) are excluded in the computation.
Generally in HIN analysis, the path length plays a vital role in de¯ning the
semantics of meta-paths. Semantics expressed by very long meta-paths is considered
to be less signi¯cant in HIN-based problems.17 Therefore, the de¯nition of Path
Index is modi¯ed to Association index in such a way that the e®ect of meta-path
length is also taken into consideration.
Association Index A: Association Index (A) modi¯es I for an lncRNA–disease
pair depending on the length of meta-path for which it is de¯ned. While computing
the strength of association, each meta-path is required to be assigned a weightage,
proportional to its length. To ensure this, while modifying I to A, an inhibiting
factor  is introduced. The inhibiting factor suppresses the strength of the associa-
tion as the meta-path length increases.
Association Index for an lncRNA–disease pair is de¯ned as inhibiting factor ()
times the Path Index. For LDHIN, A is the three-dimensional matrix:
A½i; j; k ¼ I ½i; j; k;
where  ¼ ðlength of meta-path Pk Þ 1 .
Algorithm 1 returns Association Index matrix (A½ ) for the LDHIN. A pictorial
representation of Association Index matrix for m lncRNAs, n diseases and for k
meta-paths is given in Supplementary Fig. 2.

2.2.4. Prediction of novel LncRNA–disease associations


The novel lncRNA–disease associations are predicted by constructing an SVM
classi¯er, whose feature set is constructed using Association Index between all
lncRNA–disease pairs for certain selected meta-paths.
It is computationally infeasible to consider every possible lncRNA–disease meta-path
as the number can grow inde¯nitely. Besides, the semantic relevance of meta-paths
decreases when the length exceeds certain threshold.17,22 In practice, HIN-based studies
¯x some optimum threshold for meta-path length for (a) guaranteeing computational
feasibility and (b) discarding semantically irrelevant meta-paths. The value of this
threshold is generally ¯xed randomly, empirically or based on some heuristics.

1950020-7
P. V. Sunil Kumar & G. Gopakumar

Algorithm 1. ComputeAssociationIndex
Input:
G(V, E): the LDHIN
Output:
Association Index Matrix A[ ] for LDHIN
Initialize:
T = {lncRNA, disease}
Define φ : V → T : object type mapping of G

for each meta-path Pk in G do


l ← Length(Pk )
β ← 1l
for each vi , vj ∈ V do
if φ(vi ) = Start(Pk ) and φ(vj ) = End(Pk ) then
C[i, j, k] ← Number of Pk paths between vi and vj
C[i,j,k]
I[i, j, k] = min(deg(v i ),deg(vj ))
A[i, j, k] = βI[i, j, k]
end if
end for
end for
return A[ ]
function Length(P )
return the length of path P
end function
function Start(P )
return the starting object type of path P
end function
function End(P )
return the ending object type of path P
end function

An empirical strategy is adopted in the proposed method. We observed im-


provement in accuracy by conducting experiments with meta-paths of di®erent
length thresholds, starting from two. The threshold is incremented in steps and
accuracy is estimated. Results improved consistently until the threshold reached
four. Afterwards, the improvement became too feeble to justify the computational
overhead to process longer meta-paths. Hence, in this work, the optimum meta-path
length is set at four and a total number of 14 meta-paths are taken. Details of the
experiment are provided in Sec. 3.

1950020-8
Inferring disease and pathway associations of LncRNAs

(a) Statistics of novel associations (b) Con¯rmation percentage in various rank groups

Fig. 3. Details of novel lncRNA–disease associations. (a) Shows the pie-distribution of novel associations
validated using the databases MNDR and lncRNADisease. The uncon¯rmed region represents the asso-
ciations that could not be validated using literature. (b) Shows the percentage of con¯rmed novel asso-
ciations in various rank groups.

Thus, the experiment consists of 1115 associations of 598 lncRNAs, 115 diseases
(Table 1), and 14 meta-paths. First, the Association Index matrix is prepared from
these data. When vectorized along the tube (meta-path) direction, the matrix yields
the Association Index value for an lncRNA–disease pair corresponding to each meta-
path. This acts as one feature attribute for that pair. Combining association indexes
for all meta-paths provides the complete feature vector with 14 attributes for that
particular pair. The process is repeated for all possible pairs to get the complete
feature set, which is diagrammatically given in Supplementary Fig. 3.
Based on this feature set, an SVM classi¯er is modeled to perform the prediction
of novel lncRNA–disease associations. The implementation is done with Caret23 in R
with the \e1071" library to obtain the prediction probabilities of lncRNA–disease
associations. The prediction probabilities are used as the predictor score for each
association which are ranked based on the predictor score.

2.2.5. LncRNA–pathway association


The lncRNA–disease associations along with the technique of gene set enrichment
are applied to associate lncRNAs with pathways. Initially, the validated novel
lncRNA–disease associations identi¯ed by the model and already existing ones are
combined to a single set. For each lncRNA, the diseases connected to it are identi¯ed.
The genes associated with all such diseases are extracted from the curated gene–
disease data repository DisGeNET24 and the lncRNA is connected to those genes.
The associated genes of each lncRNA are placed in di®erent sets to perform the
enrichment process. Enrichment is conducted using the Enrichr tool25 and its

1950020-9
P. V. Sunil Kumar & G. Gopakumar

associated R interface enrichR. Out of the 35 gene-set libraries comprised in


Enrichr, KEGG 2016 (Kyoto Encyclopedia of Genes and Genomes)26 is selected to
establish lncRNA–pathway associations because it includes a separate category for
human diseases. Moreover, it includes the pathways of many of the diseases that are
present in our study.

3. Results
The model predicted 668 novel lncRNA–disease associations. In order to validate the
prediction results, we used two well-known curated open repositories, MNDR
(Mammal NcRNA Disease Repository)27 and Lnc2Cancer,28 which contain experi-
mentally veri¯ed high quality interaction data of lncRNAs and diseases. Using
MDNR, 390 associations were validated and 40 associations were con¯rmed with
Lnc2Cancer. We could not validate 238 of the predicted associations due to the lack
of evidence in the current literature. The statistics of the prediction is provided in
Fig. 3(a).
The validation repository MNDR is an integration of ncRNA–disease association
from 10 resources including LncRNADisease. While constructing the model, we
considered the associations from lncRNADisease alone. Hence, during the model
construction and prediction process, the associations from MNDR are not seen by the
model, whatsoever. Once the prediction process is over, the novel associations are
¯ltered out and veri¯ed using MNDR. We termed the predictions as \novel" only if
they are not present in LncRNADisease. Thus, as far as the prediction model is
concerned, they are new ones and MNDR is quali¯ed to validate them, as it is a
superset of LncRNADisease.
In order to assess the e®ectiveness and utility of the model, the prediction results
are divided into four groups based on their ranks (associations having top 25 ranks,
top 50 ranks, top 100 ranks, and top 200 ranks) and the percentage of con¯rmed
novel associations in each group is measured. It is observed that high percentage of
novel associations is validated in the groups with associations having top ranks. Even
for the group having associations in top 200 ranks, the percentage of con¯rmation is
above 75. The results are shown in Fig. 3(b), which clearly demonstrate the use-
fulness of the proposed method in predicting lncRNA–disease associations from
LDHIN. The detailed list of predictions including the rank and reference (wherever
available) of each association is provided in Supplementary Table 1.
In order to enhance the credibility of the uncon¯rmed cases, an investigation is
conducted to check whether the lncRNAs are expressed in tissues related to the cor-
responding diseases. NONCODE46 repository is used for this purpose. NONCODE
consists of expression values of lncRNA transcripts in various tissues measured in
FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Out of the
238 uncon¯rmed lncRNA–disease associations in Supplementary Table 1, expression
details of 164 lncRNAs are found in NONCODE. Among these 164 lncRNAs, 128 are
found to be expressing in certain tissues with regard to the related diseases.

1950020-10
Inferring disease and pathway associations of LncRNAs

Fig. 4. Variation of meta-path length and accuracy against various meta-path lengths. The increment in
accuracy justi¯es the number of paths to be processed up to a path length of 4. Afterwards, the increment
in accuracy is almost linear or settles to a constant value whereas that of number of meta-paths is close to
exponential.

3.1. Determining optimum meta-path length


In this section, the results of the procedure, as described in Sec. 2.2.4, to empirically
determine the optimum value of meta-path length threshold are presented.
Meta-paths of length one are direct edges in the LDHIN and are trivial. Hence,
meta-paths of length two and above are considered. The variation of number of
meta-paths to be processed and the accuracy against meta-path length is shown in
Fig. 4. It is clear that from meta-path length two to three and from three to four,
improvement in accuracy is abrupt and afterwards it is feeble. After four, the ac-
curacy has not improved to a level which can be justi¯ed against the extra compu-
tational cost to process additional number of paths. From the properties of HIN, the
same trend is expected after six and the trial is stopped at six (Fig. 4 shows the trend
up to six), ¯xing the optimum meta-path length to be four.

3.2. Performance evaluation


The probabilities assigned by the SVM classi¯er to each lncRNA–disease association
are taken as the predictor score. All pairs are sorted based on the predictor score. The
number of known samples with higher predictor score than a speci¯c threshold score
(termed as ) are True Positives (TP) and the number of unknown samples with a
lower predictor score than  are True Negatives (TN). False Positives (FP) are the
number of unknown associations with predictor score above  whereas False Nega-
tives (FN) are known associations with a score below .
The prediction performance is validated with Leave One Out Cross-Validation
(LOOCV). One known lncRNA–disease pair is left out as test sample in rotation and

1950020-11
P. V. Sunil Kumar & G. Gopakumar

(a) ROC curve (b) Other performance parameters

Fig. 5. Performance evaluation of the prediction model. (a) Shows the ROC curve and AUC value.
(b) Shows the values obtained for di®erent performance metrics for four di®erent thresholds.

other experimentally veri¯ed associations are considered as training samples. Asso-


ciations which are not experimentally supported constitute the test set. By varying
the threshold value, Receiver Operating Characteristic (ROC) curve is drawn as
shown in Fig. 5(a). The AUC value obtained is 0.8708. The performance of the
classi¯er is further evaluated using standard statistical parameters such as Sensi-
tivity (SEN), Speci¯city (SPE), Accuracy (ACC), Precision (PRE), F1-score (F1),
and Matthews Correlation Coe±cient (MCC). The values obtained are plotted in
Fig. 5(b).

3.3. Comparison with existing approaches


A performance comparison of our method with other state-of-the-art methods in the
literature is conducted based on LOOCV and AUC values. A gold standard dataset,
prepared from lncRNADisease database and used in many models for performance
evaluation, is downloaded.29 This dataset consists of 292 lncRNA–disease associa-
tions of 118 lncRNAs and 167 diseases. We compared our model with three other
models, FMLNCSIM, IRWRLDA, and the model by Ping et al. The AUC values
obtained are summarized in Table 2. Our approach showed better performance in
terms of AUC values than the other three.

Table 2. Comparison with existing methods.

Models AUC Reference


Proposed method 0.8857 

Ping et al. 0.8535 29
FMLNCSIM 0.8266 30
IRWRLDA 0.7242 31

1950020-12
Inferring disease and pathway associations of LncRNAs

3.4. Case studies


To verify the e®ectiveness of our prediction model, we investigated the lncRNA
associations of three common diseases: Alzheimer's disease (AD), breast cancer, and
lung cancer. AD is the most common type of dementia and is currently incurable.32
Biological evidence suggests that there exists a strong association between altered
lncRNA expression pattern and AD.33 Breast cancer is one of the most aggressive
and frequent diseases in females, whose mechanism is largely unknown till date.
Recent clinical experiments strongly suggest the involvement of lncRNA and
miRNAs in breast cancer.42 Lung cancer is another cancer type with very high
mortality rate due to late diagnosis.2,43 LncRNAs are found to be the most common
biomarkers in lung cancer.43 The lncRNA association results of these three diseases
obtained by our study are summarized in Table 3. It is clear that the top ¯ve lncRNA
associations of these three diseases hold ranks below 30 in our experiments. All the
associations in Table 3 are novel and were not part of training data.

3.5. LncRNA–pathway associations


As a computational initiative to associate lncRNAs to pathways, the disease asso-
ciations of all lncRNAs under study are considered. As described in Sec. 2.2.5, the
enrichment of gene sets associated with each lncRNA resulted in a number of
pathways. Since the lncRNA–pathway association research has not been matured,
we could validate only a few of them. Therefore, we con¯ne our study to four

Table 3. LncRNA associations of AD, breast


cancer, and lung cancer, ranked below 30 by
the proposed model. All associations shown are
experimentally proven.

LncRNA Rank Evidence (database)


AD
GAS5 8 MNDR
PCA3 21 MNDR
XIST 10 MNDR
CASC2 15 MNDR
BANCR 20 MNDR
Breast cancer
ASBEL 29 Lnc2Cancer
LUNAR1 25 Lnc2Cancer
FENDRR 24 MNDR, Lnc2Cancer
GDNF-AS1 13 MNDR
FEZF1-AS1 16 Lnc2Cancer
Lung cancer
SNHG12 23 MNDR, Lnc2Cancer
BCAR4 18 MNDR, Lnc2Cancer
MIAT 7 MNDR, lnc2cancer
NEAT1 27 MNDR, lnc2cancer
TUSC7 12 MNDR

1950020-13
P. V. Sunil Kumar & G. Gopakumar

Table 4. LncRNA-pathway associations identi¯ed by the model for four


popular lncRNAs

# LncRNA Pathway (KEGG 2016) p-value Reference

1 HOTAIR hsa05223 7.611e14 34


2 HOTAIR hsa05219 7.032e12 34
3 HOTAIR hsa05212 8.343e11 35
4 HOTAIR hsa05214 0.000005894 not in literature
5 HOTAIR hsa04010 0.000008241 not in literature
6 HOTAIR hsa05218 1.214e10 36
7 TUG1 hsa05211 0.000002843 37
8 TUG1 hsa05223 0.000007429 38
9 TUG1 hsa04151 0.000009658 not in literature
10 TUG1 hsa05223 0.000075421 not in literature
11 NEAT1 hsa05214 0.000003868 39
12 MALAT1 hsa05212 2.126e13 40
13 MALAT1 hsa05218 3.349e13 41
14 MALAT1 hsa05213 5.268e12 not in literature
15 MALAT1 hsa05215 8.652e10 not in literature
16 MALAT1 hsa04151 0.00000615 not in literature

well-known lncRNAs (HOTAIR, TUG1, NEAT1, MALAT1) that have the most
number of disease associations in the literature. The top ten pathway associations
based on p-value, taken from Enrichr for HOTAIR, TUG1, NEAT1, and MALAT1,
are provided in Supplementary Figs. 4–7, respectively.
The prominent associations of these four lncRNAs identi¯ed by the model are
summed up in Table 4. The table shows the p-values of the enriched genes associated
with the lncRNAs. The last column highlights the available literature con¯rmation.

4. Discussion
This section highlights the in°uence of Association Index on the prediction results
and the inferences from the identi¯cation of lncRNA–pathway associations.

4.1. E®ectiveness of association index


The links in LDHIN are mappings of real-world biological associations between
lncRNAs and diseases. The meta-paths are connections of two or more links. Dif-
ferent meta-paths have di®erent e®ects on the connecting objects depending upon
their semantics. Therefore, the in°uence of meta-paths on objects connected by them
must be taken into consideration before constructing a meta-path-based HIN model.
The parameter Association Index was derived in such a way that it simulta-
neously re°ects total number and in°uence of meta-paths in any lncRNA–disease
pair. Association Index estimates the in°uence of meta-paths by measuring how
many ways an lncRNA–disease pair are actually connected out of the total possible
ways of interconnection. The existence of more number of in°uential meta-paths in
an lncRNA–disease pair signi¯es stronger biological relationship in the pair.

1950020-14
Inferring disease and pathway associations of LncRNAs

4.2. LncRNA–pathway associations


When the lncRNA–pathway associations (from Table 4) that are not in the literature
were further analyzed, it has been revealed that biological evidence existed for mu-
tual associations of pathways connected through common lncRNAs. For instance,
our study was able to associate lncRNA HOTAIR with the pathways hsa05214-
glioma and hsa04010-MAPK signaling pathway (Rows 4 and 5 of Table 4). No direct
biological evidence exists for both of these associations. However, biological studies
suggest that dysregulations in MAPK signaling pathway cause glioma.44 Similarly,
we identi¯ed that lncRNAs TUG1 and MALAT1 are associated to the pathways
hsa04151-PI3K-Akt signaling pathway (Rows 9 and 16). We identi¯ed associations
of TUG1 with hsa05223-Lung cancer (Row 10) and MALAT1 with hsa05215-pros-
tate cancer (Row 15). It has been proven that PI3K-Akt signaling pathway plays key
roles in endometrial cancer and prostate cancer.45
Thus, we encountered many instances where the direct lncRNA–pathway asso-
ciations remain uncon¯rmed, at the same time, in°uence or interaction between the
pathways which are associated through common lncRNAs are con¯rmed. Further
biological exploration is required to verify whether these common lncRNAs play
some part in determining the corresponding pathway interrelationships.

5. Conclusion
Growing evidence from recent experiments suggests that lncRNAs are key mod-
ulators in a variety of biological and pathological processes including metabolism,
gene regulation, genomic imprinting. Clinical research identi¯es distinct types of
mutations, dysregulations, and aberrant expressions of lncRNAs as vital reasons
behind several human diseases.
In this work, we propose a new model using HIN principles to represent the
interactions of lncRNAs with diseases and pathways. The meta-path-based param-
eter, Association Index is proved to be e®ective in determining lncRNA–disease
associations when provided as features for an SVM classi¯er. The results also dem-
onstrate the usefulness of the proposed model for the representation of biomolecular
associations and unearthing novel interactions.
Another contribution of the proposed work is the prediction of lncRNA–pathway
associations, a ¯eld that requires further research attention. We identi¯ed novel
lncRNA–pathway associations through disease interactions and observed that
pathways associated with common lncRNAs hold important biological inter-
relationships. Further investigation is required to con¯rm whether such lncRNAs
have any in°uence in determining the interdependence among pathways.

References
o R, Johnson R, Towards a
1. Uszczynska-Ratajczak B, Lagarde J, Frankish A, Guig
complete map of the human long non-coding RNA transcriptome, Nat Rev Genet
19(9):535–548, 2018.

1950020-15
P. V. Sunil Kumar & G. Gopakumar

2. Ding L, Wang M, Sun D, Li A, TPGLDA: Novel prediction of associations between


lncRNAs and diseases via lncRNA-disease-gene tripartite graph, Sci Rep 8(1):1065, 2018.
3. Zhang J, Zhang Z, Chen Z, Deng L, Integrating multiple heterogeneous networks for
novel lncRNA-disease association inference, IEEE/ACM Trans Comput Biol Bioinform
16(2):396–406, 2017.
4. Li S, Li B, Zheng Y, Li M, Shi L, Pu X, Exploring functions of long noncoding RNAs
across multiple cancers through co-expression network, Sci Rep 7(1):754, 2017.
5. Gu C, Liao B, Li X, Cai L, Li Z, Li K, Yang J, Global network random walk for predicting
potential human lncRNA-disease associations, Sci Rep 7(1):12442, 2017.
6. Lu C, Yang M, Luo F, Wu F-X, Li M, Pan Y, Li Y, Wang J, Prediction of lncRNA-disease
associations based on inductive matrix completion, Bioinformatics 1:8, 2018.
7. Chen X, KATZLDA: KATZ measure for the lncRNA-disease association prediction, Sci
Rep 5:16840, 2015.
8. Liu M-X, Chen X, Chen G, Cui Q-H, Yan G-Y, A computational framework to infer
human disease-associated long noncoding RNAs, PloS One 9(1):e84408, 2014.
9. Chen X, Yan G-Y, Novel human lncRNA–disease association inference based on lncRNA
expression pro¯les, Bioinformatics 29(20):2617–2624, 2013.
10. Chen X, Yan CC, Luo C, Ji W, Zhang Y, Dai Q, Constructing lncRNA functional
similarity network based on lncRNA-disease associations and disease semantic similarity,
Sci Rep 5:11338, 2015.
11. Sun J, Shi H, Wang Z, Zhang C, Liu L, Wang L, He W, Hao D, Liu S, Zhou M, Inferring
novel lncRNA–disease associations based on a random walk model of a lncRNA functional
similarity network, Mol BioSyst 10(8):2074–2081, 2014.
12. Chen X, You Z, Yan G, Gong D, IRWRLDA: Improved random walk with restart for
lncRNA-disease association prediction, Oncotarget 7:57919–57931, 2016.
13. Yao Q, Wu L, Li J, Guang Yang L, Sun Y, Li Z, He S, Feng F, Li H, Li Y, Global
prioritizing disease candidate lncRNAs via a multi-level composite network, Sci Rep
7:39516, 2017.
14. Liu Y, Chance MR, Pathway analyses and understanding disease associations, Curr
Genet Med Rep 1(4):230–238, 2013.
15. Li Y, Agarwal P, A pathway-based view of human diseases and disease relationships, PloS
One 4(2):e4346, 2009.
16. Han J et al., LncRNAs2Pathways: Identifying the pathways in°uenced by a set of
lncRNAs of interest based on a global network propagation method, Sci Rep 7:46566,
2017.
17. Sun Y, Han J, Mining heterogeneous information networks: Principles and methodolo-
gies, Synth Lect Data Min Knowl Discov 3(2):1–159, 2012.
18. Shi C, Li Y, Zhang J, Sun Y, Yu PS, A survey of heterogeneous information network
analysis, IEEE Trans Knowl Data Eng 29(1):17–37, 2017.
19. Chen G, Wang Z, Wang D, Qiu C, Liu M, Chen X, Zhang Q, Yan G, Cui Q,
LncRNADisease: A database for long-non-coding RNA-associated diseases, Nucleic Acids
Res 41(D1):D983–D986, 2012.
20. Schriml LM, Arze C, Nadendla S, Chang Y-WW, Mazaitis M, Felix V, Feng G, Kibbe
WA, Disease Ontology: A backbone for disease semantic integration, Nucleic Acids Res
40(D1):D940–D946, 2011.
21. Yu G, Wang L-G, Yan G-R, He Q-Y, DOSE: An R/Bioconductor package for disease
ontology semantic and enrichment analysis, Bioinformatics 31(4):608–609, 2014.
22. Cao B, Kong X, Philip SY, Collective prediction of multiple types of links in heteroge-
neous information networks, 2014 IEEE Int Conf Data Mining (ICDM), pp. 50–59, 2014.
23. Kuhn M et al., Caret package, J Stat Softw 28(5):1–26, 2008.

1950020-16
Inferring disease and pathway associations of LncRNAs

24. Piñero J, Bravo A, Queralt-Rosinach N, Gutierrez-Sacrist an A, Deu-Pons J, Centeno E,


García-García J, Sanz F, Furlong LI, Disgenet: A comprehensive platform integrating
information on human disease-associated genes and variants, Nucl Acids Res 45(D1):
D833–D839, 2016.
25. Kuleshov MV et al., Enrichr: A comprehensive gene set enrichment analysis web server
2016 update, Nucl Acids Res 44(W1):W90–W97, 2016.
26. Kanehisa M, Goto S, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids
Res 28(1):27–30, 2000.
27. Cui T, Zhang L, Huang Y, Yi Y, Tan P, Zhao Y, Hu Y, Xu L, Li E, Wang D, MNDR v2.
0: An updated resource of ncRNA–disease associations in mammals, Nucl Acids Res
46(D1):D371–D374, 2017.
28. Ning S et al., Lnc2Cancer: A manually curated database of experimentally supported
lncRNAs associated with various human cancers, Nucl Acids Res 44(D1):D980–D985,
2015.
29. Ping P, Wang L, Kuang L, Ye S, Iqbal MFB, Pei T, A novel method for lncrna-disease
association prediction based on an lncrna-disease association network, IEEE/ACM Trans
Comput Biol Bioinform 16(2):688–693, 2018.
30. Chen X, Huang Y, Wang X, You Z, Chan K, FMLNCSIM: Fuzzy measure-based lncRNA
functional similarity calculation model, Oncotarget 7:45948–58, 2016.
31. Chen X, You Z-H, Yan G-Y, Gong D-W, IRWRLDA: Improved random walk with
restart for lncRNA-disease association prediction, Oncotarget 7(36):57919, 2016.
32. Wang J, Gu BJ, Masters CL, Wang Y-J, A systemic view of Alzheimer
disease insights from amyloid- metabolism beyond the brain, Nat Rev Neurol
13(10):612, 2017.
33. Zhou M, Zhao H, Wang X, Sun J, Su J, Analysis of long noncoding RNAs highlights
region-speci¯c altered expression patterns and diagnostic roles in Alzheimers disease,
Brief Bioinform 20(2):598–608, 2018.
34. Hajjari M, Salavaty A, HOTAIR: An oncogenic long non-coding RNA in di®erent can-
cers, Cancer Biol Med 12(1):1, 2015.
35. Cai H et al., LncRNA HOTAIR acts as competing endogenous RNA to control the
expression of Notch3 via sponging miR-613 in pancreatic cancer, Oncotarget 8(20):32905,
2017.
36. Luan W, Li R, Liu L, Ni X, Shi Y, Xia Y, Wang J, Lu F, Xu B, Long non-coding RNA
HOTAIR acts as a competing endogenous RNA to promote malignant melanoma pro-
gression by sponging miR-152-3p, Oncotarget 8(49):85401, 2017.
37. Li M et al., Long non-coding RNAs in renal cell carcinoma: A systematic review and
clinical implications, Oncotarget 8(29):48424, 2017.
38. He Q, Yang S, Gu X, Li M, Wang C, Wei F, Long noncoding RNA TUG1 facilitates
osteogenic di®erentiation of periodontal ligament stem cells via interacting with Lin28A,
Cell Death Dis 9(5):455, 2018.
39. Peng Z, Liu C, Wu M, New insights into long noncoding RNAs and their roles in glioma,
Mol Cancer 17(1):61, 2018.
40. Liu J-H, Chen G, Dang Y-W, Li C-J, Luo D-Z, Expression and prognostic signi¯cance of
lncRNA MALAT1 in pancreatic cancer tissues, Asian Pac J Cancer Prev 15(7):2971–
2977, 2014.
41. Luan W, Li L, Shi Y, Bu X, Xia Y, Wang J, Djangmah HS, Liu X, You Y, Xu B,
Long non-coding RNA MALAT1 acts as a competing endogenous RNA to promote
malignant melanoma growth and metastasis by sponging miR-22, Oncotarget
7(39):63901, 2016.

1950020-17
P. V. Sunil Kumar & G. Gopakumar

42. Zhang G, Pian C, Chen Z, Zhang J, Xu M, Zhang L, Chen Y, Identi¯cation of cancer-


related miRNA-lncRNA biomarkers using a basic miRNA-lncRNA network, PloS One
13(5):e0196681, 2018.
43. Kunz M, Wolf B, Schulze H, Atlan D, Walles T, Walles H, Dandekar T, Non-coding
RNAs in lung cancer: Contribution of bioinformatics analysis to the development of non-
invasive diagnostic tools, Genes 8(1):8, 2016.
44. Nakada M, Kita D, Watanabe T, Hayashi Y, Teng L, Pyko IV, Hamada J-I, Aberrant
signaling pathways in glioma, Cancers 3(3):3242–3278, 2011.
45. Liu P, Cheng H, Roberts TM, Zhao JJ, Targeting the phosphoinositide 3-kinase pathway
in cancer, Nat Rev Drug Discov 8(8):627, 2009.
46. Zhao Y, Li H, Fang S, Kang Y, Wu W, Hao Y, Li Z, Bu D, Sun N, Zhang MQ, Chen R,
NONCODE 2016: An informative and valuable data source of long non-coding RNAs,
Nucleic Acids Res 44(D1):D203, 2016.

P. V. Sunil Kumar is currently pursuing Ph.D. in Computer


Science and Engineering at Department of Computer Science and
Engineering, National Institute of Technology Calicut, India. He
received his P.G. in Computer Science and Engineering from
College of Engineering Guindy, Anna University Chennai, India in
the year 2011. His areas of interests include Computational Biol-
ogy Data Mining and Biological Networks.

G. Gopakumar received his Ph.D. in Bioinformatics from the


University of Kerala, India in 2013. Presently he is working as an
Assistant Professor in the Department of Computer Science and
Engineering, National Institute of Technology Calicut, India since
July 2010. His research interests include RNA Bioinformatics,
Biological Network Analysis and Data Mining. He is a member of
the IEEE as well as ACM.

1950020-18

You might also like