Professional Documents
Culture Documents
JBCB
JBCB
Recent ¯ndings from biological experiments demonstrate that long non-coding RNAs
(lncRNAs) are actively involved in critical cellular processes and are associated with innu-
merable diseases. Computational prediction of lncRNA–disease association draws tremendous
research attention nowadays. This paper proposes a machine learning model that predicts
lncRNA–disease associations using Heterogeneous Information Network (HIN) of lncRNAs
and diseases. A Support Vector Machine classi¯er is developed using the feature set extracted
from a meta-path-based parameter, Association Index derived from the HIN. Performance of
the model is validated using standard statistical metrics and it generated an AUC value of
0.87, which is better than the existing methods in the literature. Results are further validated
using the recent literature and many of the predicted lncRNA–disease associations are iden-
ti¯ed as actually existing. This paper also proposes an HIN-based methodology to associate
lncRNAs with pathways in which they may have biological in°uence. A case study on the
pathway associations of four well-known lncRNAs (HOTAIR, TUG1, NEAT1, and
MALAT1) has been conducted. It has been observed that many times the same lncRNA is
associated with more than one biologically related pathways. Further exploration is needed to
substantiate whether such lncRNAs have any role in determining the pathway interplay. The
script and sample data for the model construction is freely available at http://bdbl.nitc.ac.in/
LncDisPath/index.html.
1. Introduction
Accumulating evidence identify long non-coding RNAs (lncRNA) as active players in
a variety of biological mechanisms and numerous diseases. An RNA molecule with
very low protein coding potential is known as ncRNA and ncRNAs longer than 200
nucleotides are LncRNAs.1 As per recent studies, lncRNAs are actively involved in
processes such as chromatin modi¯cation, transcriptional, post-transcriptional and
1950020-1
P. V. Sunil Kumar & G. Gopakumar
1950020-2
Inferring disease and pathway associations of LncRNAs
2.1. Preliminaries
Formal de¯nitions associated with HIN that are used further in the paper are
de¯ned here.
Heterogeneous Information Network17,18:
An Information Network is de¯ned as a directed graph G ¼ ðV ; EÞ with an object-
type mapping function : V ! T and a link-type mapping function : E ! R,
where each object v 2 V belongs to one particular object type ðvÞ 2 T , each link
e 2 E belongs to a particular relation ðeÞ 2 R, and if two links belong to the same
relation type, the two links share the same starting object type as well as the ending
object type.
The information network is called HIN if the number of types of relations jRj > 1
or the number of types of objects jT j > 1.
Network Schema17,18:
The network schema, denoted as SG ¼ ðT ; RÞ, is a meta-template for an information
network G ¼ ðV ; EÞ with the object-type mapping : V ! T and the link-type
mapping : E ! R, which is a directed graph de¯ned over object types T , with
1950020-3
P. V. Sunil Kumar & G. Gopakumar
edges as relations from R. The network schema of an HIN speci¯es type constraints
on the sets of objects and relationships among the objects.
Meta-Path17,18:
A meta-path P is a path de¯ned on a schema SG ¼ ðT ; RÞ, and is denoted in the form
T1 !R1 T2 !R2 !Rl Tlþ1 which de¯nes a composite relation R ¼ R1 R2
Rl in objects T1 ; T2 ; . . . ; Tlþ1 , where denotes the composition operator on relations.
Fig. 1. The work°ow of the proposed method consists of ¯ve phases. The novel lncRNA–diseases
associations and lncRNA–pathway associations are the end results.
1950020-4
Inferring disease and pathway associations of LncRNAs
1950020-5
P. V. Sunil Kumar & G. Gopakumar
Fig. 2. Cumulative frequency distribution of links between lncRNAs and diseases. For lncRNA links, the
cumulative frequency retained stability at a similarity score of 0.5 and for disease links at 0.4.
applied to regularize the number of links between lncRNAs and diseases. The cu-
mulative frequency distribution of number of links for lncRNAs and diseases for
various similarity cuto®s are depicted in Figs. 2(a) and 2(b), respectively. It can be
observed that the number of links between lncRNAs and diseases retain stability at
similarity scores of 0.5 and 0.4, respectively. Therefore in the LDHIN, an edge is
drawn between a pair of lncRNAs if the functional similarity score is equal or above
0.5, and for a pair of diseases if the score is 0.4 or above.
The details of objects and links in the LDHIN under study are presented in
Table 1 and a portion of LDHIN is shown in supplementary Fig. 1.
#Objects #Links
LncRNA (L) Diseases (D) LD LL DD
598 155 1115 32026 1777
1950020-6
Inferring disease and pathway associations of LncRNAs
three-dimensional matrix:
C½i; j; k ¼ Number of Pk meta-paths between lncRNAi and diseasej in LDHIN:
Path Index (I ): Path index measures the ratio of meta-paths of a certain type to all
possible meta-paths in an lncRNA–disease pair. This parameter is de¯ned to esti-
mate in°uence of a particular meta-path on a speci¯c lncRNA–disease pair.
Extending to LDHIN, the three-dimensional path index matrix is obtained as
C½i; j; k
I ½i; j; k ¼
minðdegðlncRNAi Þ; degðdiseasej ÞÞ
where degðxÞ represents the degree of an object x in LDHIN. Obviously, isolated
objects (degðxÞ ¼ 0) are excluded in the computation.
Generally in HIN analysis, the path length plays a vital role in de¯ning the
semantics of meta-paths. Semantics expressed by very long meta-paths is considered
to be less signi¯cant in HIN-based problems.17 Therefore, the de¯nition of Path
Index is modi¯ed to Association index in such a way that the e®ect of meta-path
length is also taken into consideration.
Association Index A: Association Index (A) modi¯es I for an lncRNA–disease
pair depending on the length of meta-path for which it is de¯ned. While computing
the strength of association, each meta-path is required to be assigned a weightage,
proportional to its length. To ensure this, while modifying I to A, an inhibiting
factor is introduced. The inhibiting factor suppresses the strength of the associa-
tion as the meta-path length increases.
Association Index for an lncRNA–disease pair is de¯ned as inhibiting factor ()
times the Path Index. For LDHIN, A is the three-dimensional matrix:
A½i; j; k ¼ I ½i; j; k;
where ¼ ðlength of meta-path Pk Þ 1 .
Algorithm 1 returns Association Index matrix (A½ ) for the LDHIN. A pictorial
representation of Association Index matrix for m lncRNAs, n diseases and for k
meta-paths is given in Supplementary Fig. 2.
1950020-7
P. V. Sunil Kumar & G. Gopakumar
Algorithm 1. ComputeAssociationIndex
Input:
G(V, E): the LDHIN
Output:
Association Index Matrix A[ ] for LDHIN
Initialize:
T = {lncRNA, disease}
Define φ : V → T : object type mapping of G
1950020-8
Inferring disease and pathway associations of LncRNAs
(a) Statistics of novel associations (b) Con¯rmation percentage in various rank groups
Fig. 3. Details of novel lncRNA–disease associations. (a) Shows the pie-distribution of novel associations
validated using the databases MNDR and lncRNADisease. The uncon¯rmed region represents the asso-
ciations that could not be validated using literature. (b) Shows the percentage of con¯rmed novel asso-
ciations in various rank groups.
Thus, the experiment consists of 1115 associations of 598 lncRNAs, 115 diseases
(Table 1), and 14 meta-paths. First, the Association Index matrix is prepared from
these data. When vectorized along the tube (meta-path) direction, the matrix yields
the Association Index value for an lncRNA–disease pair corresponding to each meta-
path. This acts as one feature attribute for that pair. Combining association indexes
for all meta-paths provides the complete feature vector with 14 attributes for that
particular pair. The process is repeated for all possible pairs to get the complete
feature set, which is diagrammatically given in Supplementary Fig. 3.
Based on this feature set, an SVM classi¯er is modeled to perform the prediction
of novel lncRNA–disease associations. The implementation is done with Caret23 in R
with the \e1071" library to obtain the prediction probabilities of lncRNA–disease
associations. The prediction probabilities are used as the predictor score for each
association which are ranked based on the predictor score.
1950020-9
P. V. Sunil Kumar & G. Gopakumar
3. Results
The model predicted 668 novel lncRNA–disease associations. In order to validate the
prediction results, we used two well-known curated open repositories, MNDR
(Mammal NcRNA Disease Repository)27 and Lnc2Cancer,28 which contain experi-
mentally veri¯ed high quality interaction data of lncRNAs and diseases. Using
MDNR, 390 associations were validated and 40 associations were con¯rmed with
Lnc2Cancer. We could not validate 238 of the predicted associations due to the lack
of evidence in the current literature. The statistics of the prediction is provided in
Fig. 3(a).
The validation repository MNDR is an integration of ncRNA–disease association
from 10 resources including LncRNADisease. While constructing the model, we
considered the associations from lncRNADisease alone. Hence, during the model
construction and prediction process, the associations from MNDR are not seen by the
model, whatsoever. Once the prediction process is over, the novel associations are
¯ltered out and veri¯ed using MNDR. We termed the predictions as \novel" only if
they are not present in LncRNADisease. Thus, as far as the prediction model is
concerned, they are new ones and MNDR is quali¯ed to validate them, as it is a
superset of LncRNADisease.
In order to assess the e®ectiveness and utility of the model, the prediction results
are divided into four groups based on their ranks (associations having top 25 ranks,
top 50 ranks, top 100 ranks, and top 200 ranks) and the percentage of con¯rmed
novel associations in each group is measured. It is observed that high percentage of
novel associations is validated in the groups with associations having top ranks. Even
for the group having associations in top 200 ranks, the percentage of con¯rmation is
above 75. The results are shown in Fig. 3(b), which clearly demonstrate the use-
fulness of the proposed method in predicting lncRNA–disease associations from
LDHIN. The detailed list of predictions including the rank and reference (wherever
available) of each association is provided in Supplementary Table 1.
In order to enhance the credibility of the uncon¯rmed cases, an investigation is
conducted to check whether the lncRNAs are expressed in tissues related to the cor-
responding diseases. NONCODE46 repository is used for this purpose. NONCODE
consists of expression values of lncRNA transcripts in various tissues measured in
FPKM (Fragments Per Kilobase of transcript per Million mapped reads). Out of the
238 uncon¯rmed lncRNA–disease associations in Supplementary Table 1, expression
details of 164 lncRNAs are found in NONCODE. Among these 164 lncRNAs, 128 are
found to be expressing in certain tissues with regard to the related diseases.
1950020-10
Inferring disease and pathway associations of LncRNAs
Fig. 4. Variation of meta-path length and accuracy against various meta-path lengths. The increment in
accuracy justi¯es the number of paths to be processed up to a path length of 4. Afterwards, the increment
in accuracy is almost linear or settles to a constant value whereas that of number of meta-paths is close to
exponential.
1950020-11
P. V. Sunil Kumar & G. Gopakumar
Fig. 5. Performance evaluation of the prediction model. (a) Shows the ROC curve and AUC value.
(b) Shows the values obtained for di®erent performance metrics for four di®erent thresholds.
1950020-12
Inferring disease and pathway associations of LncRNAs
1950020-13
P. V. Sunil Kumar & G. Gopakumar
well-known lncRNAs (HOTAIR, TUG1, NEAT1, MALAT1) that have the most
number of disease associations in the literature. The top ten pathway associations
based on p-value, taken from Enrichr for HOTAIR, TUG1, NEAT1, and MALAT1,
are provided in Supplementary Figs. 4–7, respectively.
The prominent associations of these four lncRNAs identi¯ed by the model are
summed up in Table 4. The table shows the p-values of the enriched genes associated
with the lncRNAs. The last column highlights the available literature con¯rmation.
4. Discussion
This section highlights the in°uence of Association Index on the prediction results
and the inferences from the identi¯cation of lncRNA–pathway associations.
1950020-14
Inferring disease and pathway associations of LncRNAs
5. Conclusion
Growing evidence from recent experiments suggests that lncRNAs are key mod-
ulators in a variety of biological and pathological processes including metabolism,
gene regulation, genomic imprinting. Clinical research identi¯es distinct types of
mutations, dysregulations, and aberrant expressions of lncRNAs as vital reasons
behind several human diseases.
In this work, we propose a new model using HIN principles to represent the
interactions of lncRNAs with diseases and pathways. The meta-path-based param-
eter, Association Index is proved to be e®ective in determining lncRNA–disease
associations when provided as features for an SVM classi¯er. The results also dem-
onstrate the usefulness of the proposed model for the representation of biomolecular
associations and unearthing novel interactions.
Another contribution of the proposed work is the prediction of lncRNA–pathway
associations, a ¯eld that requires further research attention. We identi¯ed novel
lncRNA–pathway associations through disease interactions and observed that
pathways associated with common lncRNAs hold important biological inter-
relationships. Further investigation is required to con¯rm whether such lncRNAs
have any in°uence in determining the interdependence among pathways.
References
o R, Johnson R, Towards a
1. Uszczynska-Ratajczak B, Lagarde J, Frankish A, Guig
complete map of the human long non-coding RNA transcriptome, Nat Rev Genet
19(9):535–548, 2018.
1950020-15
P. V. Sunil Kumar & G. Gopakumar
1950020-16
Inferring disease and pathway associations of LncRNAs
1950020-17
P. V. Sunil Kumar & G. Gopakumar
1950020-18