Professional Documents
Culture Documents
A Heterogeneous Information Network Model For Long Non-Coding RNA Function Prediction
A Heterogeneous Information Network Model For Long Non-Coding RNA Function Prediction
Abstract—Exciting information on the functional roles played by long non-coding RNA (lncRNA) has drawn substantial research
attention these days. With the advent of techniques such as RNA-Seq, thousands of lncRNAs are identified in very short time spans.
However, due to the poor annotation rate, only a few of them are functionally characterised. The wet lab experiments to elucidate
lncRNA functions are challenging, slow progressing and sometimes prohibitively expensive. This work attempts to solve the crucial
problem of developing computational methods to predict lncRNA functions. The model presented here, predicts the functions of
lncRNAs by making use of a meta-path based measure, AvgSim on a Heterogeneous Information Network (HIN). The network is
constructed from existing protein and function association data of lncRNAs, lncRNA co-expression data and protein protein interaction
data. Out of the 2,758 lncRNA considered for the experiment, the proposed method predicts possible functions for 2,695 lncRNAs with
an accuracy of 73.68 percent and found to perform better than the other state-of-the-art approaches for an independent test set. A case
study of two well-known lncRNAs (HOTAIR and H19) is conducted and the associated functions are identified. The results were
validated using experimental evidence from the literature. The script and data used for the implementation of the model is freely
available at: http://bdbl.nitc.ac.in/LncFunPred/index.html.
Index Terms—LncRNA, heterogeneous information network, meta-path, classification, random forest, AvgSim
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
256 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
protein interaction data into the co-expression network used meta-path based information of HIN for prediction. Possible
in Liao et al. They annotated 1,625 lncRNAs with functional functions are assigned to a total of 2,695 lncRNAs by the
characteristics. But none of these methods used Next Gener- method. The correctness is verified statistically by cross-
ation Sequencing (NGS) data for processing. Later on, Yun validation and reconfirmed through mining recent litera-
Xiao et al. [11] used a Bayesian network of lncRNA and pro- ture. A case study of two well-studied lncRNAs is also
tein using the transcript profile made from RNA-Seq data. conducted and results are validated.
This method assigned functions to 762 lncRNAs by func- The rest of the paper is organised as follows: Section 2
tional enrichment of highly interconnected proteins. To deals with the input data, methodologies and algorithms
annotate ncRNA functions, Feng Chen and Yi-Ping Phoebe used for the construction of the prediction model. Various
Chen [12] have applied bridging rule mining. They have statistical parameters used to validate the prediction perfor-
used two different measures to explore the relationship mance of the model and the details of the classifier used are
between ncRNAs, one as the linearity measure and the also outlined. Section 3 explains the results obtained. A
other the non-linearity measure. Then based on the associa- detailed discussion on the implications of the results is pro-
tion rule, functions of ncRNAs are speculated. But this vided in Section 4. The outcomes of a case study conducted
method is not exclusively devoted to lncRNAs. Qinghua forms the matter for Section 5. Section 6 provides the con-
Jiang et al. [13] performed hypergeometric test on lncRNA- cluding remarks and future directions.
protein co-expression data from RNA-Seq to predict
lncRNA function. They mapped 9,625 lncRNAs to their 2 MATERIALS AND METHODS
function as well as pathways.
All the network-based methods found in current litera- This section describes the formulation of the prediction
ture focus on lncRNA-protein interaction as a crucial metric model and derivation of data set to conduct the experi-
to devise a model that predicts the functions of lncRNAs. ments. To begin with, the major HIN concepts used further
Hence these methods can predict the functions of lncRNAs in the discussion are formally defined.
that have known protein associations. In order to exploit the Heterogeneous Information Network
fullest advantage of a network based model, the lncRNA- An Information Network [17], [18] is defined as a directed
lncRNA links are also to be taken in to account. The pro- graph G ¼ ðV; EÞ with an object type mapping function
posed work considers lncRNA-lncRNA links as well and f : V ! A and a link type mapping function c : E ! R,
predicts the functions of lncRNAs even in the absence of where each object v 2 V belongs to one particular object
known protein associations. type fðvÞ 2 A, each link e 2 E belongs to a particular rela-
Here, the problem of predicting functions of lncRNAs tion cðeÞ 2 R, and if two links belong to the same relation
which do not have known protein association is addressed type, the two links share the same starting object type as
by the incorporation of lncRNA co-expression data. A Het- well as the ending object type.
erogeneous Information Network (HIN) is built with The information network is called heterogeneous infor-
lncRNA, protein and function as node types and (a) pro- mation network if the number of types of objects jAj > 1 or
tein-protein interaction, (b) protein-function association, (c) relations jRj > 1.
lncRNA-protein interaction, (d) lncRNA co-expression, and Network Schema
(e) known lncRNA-function association as edge types. The The network schema [17], [18], denoted as TG ¼ ðA; RÞ, is
application of the method AvgSim to quantify the degree of a meta-template for an information network G ¼ ðV; EÞ
relatedness of a pair of nodes in the HIN is inspired from with the object type mapping f : V ! A and the link type
the work by J. Yang et al. [14], who adopted AvgSim from mapping c : E ! R, which is a directed graph defined over
the proposal by D. Xiao et al. [15]. object types A, with edges as relations from R. The network
LncRNA-function pairs can be connected through vari- schema of a heterogeneous information network specifies
ous paths termed as meta-paths in HIN terminology. Each type constraints on the sets of objects and relationships
meta-path carries a semantic meaning, which needs to be among the objects.
interpreted properly. If an lncRNA is connected to a func- Meta-Path
tion through a protein node, the path is ‘lncRNA-protein- A meta-path [17], [18], P is a path defined on a schema
function’ and its semantic meaning is that the function is R1 R2
TG ¼ ðA; RÞ, and is denoted in the form A1 ! A2 !
performed by lncRNA through protein interaction. Simi- Rl
larly, meta-path ‘lncRNA-lncRNA-function’ is interpreted ! Alþ1 which defines a composite relation R ¼ R1 R2
as the function is performed by lncRNA molecules, in com- . . . Rl between objects A1 ; A2 ; . . .; Alþ1 where denotes the
bination. Meta-paths and their semantic meanings are dis- composition operator on relations.
cussed in more detail in Section 2.2.
After the HIN construction, the functionally relevant 2.1 Heterogeneous LncRNA-Protein-Function
meta-paths are extracted through a correlation analysis. Network (HLPFN)
The relatedness measure, AvgSim is computed along such Heterogeneous LncRNA-Protein-Function Network (HLPFN)
relevant meta-paths. The AvgSim score along various meta- is a heterogeneous interconnection of five different interaction
paths are combined to form the features for a Random For- networks constructed from protein interactions, lncRNA co-
est classifier which performs the prediction of lncRNA expression, lncRNA functional associations, and lncRNA-pro-
functions. tein interactions and protein-function association. All relation-
In contrast to the existing methods which use only ships are kept as separate adjacency matrices to maintain the
lncRNA-protein interaction, the proposed work exploits heterogeneity of nodes and edges. The various adjacency
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 257
matrices used to construct HIN are shown in Fig. 1a. The con- contains ten functions for 1,961 lncRNA. This data is consid-
struction of the adjacency matrices of the individual networks ered as the known associations between lncRNAs and func-
is described below. tions. The matrix for known lncRNA-function association,
MLF is defined as
2.1.1 LncRNA-Protein Interaction Network
LncRNA-protein interaction data is collected from NPInter 3.0 1; if LncRNAi has Functionj
MLF ði; jÞ ¼ :
[19]. It contains interaction of ncRNAs with different kinds of 0; otherwise
biomolecules. The lncRNAs are filtered from ncRNA using
NONCODE ID [20] and the interaction level is restricted to
‘RNA-protein’ to retrieve proteins. The interactions are 2.1.5 Protein Function Association Network
restricted to ‘Homo Sapiens’. The lncRNA-protein interaction The functions associated to proteins is obtained from Uni-
network is represented as an adjacency matrix with row Prot database [24], which consists of the mapping of protein
names as lncRNAs, and column names as proteins. The edges molecules to 17,073 unique functional GO Terms. The pro-
are added based on interaction data collected. The construc- tein-function association matrix is constructed from this
tion of adjacency matrix MLP is as follows: data by the following equation:
1; if LncRNAi interacts with Proteinj
MLP ði; jÞ ¼ : 1; if Proteini has Functionj
0; otherwise MPF ði; jÞ ¼ :
0; otherwise
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
258 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
Fig. 1. Work flow of predicting lncRNA function. a) Heterogeneous lncRNA-protein-function network construction. First the homogeneous networks of
protein-protein interaction, lncRNA co-expression and lncRNA-function association are built from adjacency matrices of protein-protein interaction
data (STRING), lncRNA co-expression data (NONCODE v4.0) and lncRNA-function association data (Gene Ontology). LncRNA-protein interaction
(NPInter 3.0) matrix connects lncRNA co-expression and protein-protein interaction networks. They are connected to functions using known
lncRNA-function association data (NONCODE 4.0) and protein-function association data (Uniprot). (b) The network schema of heterogeneous
lncRNA-protein-function network consisting of three nodes lncRNA, protein, and function and edges connecting them. Meta-paths are generated by
doing depth first search in the network schema with the given path length. (c) Computation Avgsim score along all the meta-paths. (d) The score for
every lncRNA-function pair along different meta-paths is arranged as a row vector and a two-dimensional matrix is constructed by combining these
row vectors. This acts as the feature set to train the random forest classifier.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 259
TABLE 1
Details of LncRNA-Protein-Function Network
Type Count
Objects lncRNA 2758
protein 8501
function 1416
Links lncRNA-lncRNA 1947266
lncRNA-function 15689
lncRNA-protein 28317
protein-function 13658
protein-protein 58372
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
260 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
from Table 1). The AvgSim score along each relevant meta-
path for every lncRNA-function pair is computed and rep-
resented as an lf matrix. There will be a separate matrix
for each meta-path as shown in Fig. 1c. For any meta-path k,
the entry ði; jÞ of its corresponding matrix represents the
AvgSim score along the meta-path (k) between lncRNA i
and function j. The matrices for all meta-paths are depicted
in Fig. 1c. As explained in Section 2.2, there are five relevant
meta-paths in the experiment. The feature set for the classi-
fier is constructed by deriving another matrix from the indi-
vidual meta-path matrices shown in Fig. 1c. The header
column of this new matrix is the lncRNA-function pairs and
header row is the meta-paths. The entries represent the
AvgSim score between the lncRNA-function pair provided
in the header row for the meta-path provided as the column
header. This matrix is used as the feature set to train the
classifier and has the structure provided in Fig. 1d. There
are two class labels, one representing positive class which
Fig. 3. Relevant meta-path selection: Correlation analysis for relevant indicates the lncRNA and function are associated and the
meta-path selection with meta-path lengths 3 (dotted curve), 4 (dashed other representing the negative class, which indicates the
curve) and 5 (solid curve). The meta-paths lpf and llf have positive cor-
relation with known data set. The meta-paths except lplf of path length
lncRNA and function are not associated.
four have positive correlation. Three meta-paths of length five are nega- The lncRNA-function association data, taken from NON-
tively correlated with known lf associations. CODE repository provided 15,689 positive examples. Ide-
ally, all the non-associated pairs must form negative
The correlation analysis of meta-path vector of lengths examples. In such a scenario, the number of negative exam-
three, four, and five are shown in Fig. 3. It is clear that in the ples would be far more than that of positive examples and
third iteration, there exist three negatively correlated paths can lead to a skewed training set. To avoid that, we applied
among the eight possible meta-paths of length five. Hence a random sampling on the non-associated pairs and selected
the path length is fixed to be four. Thus, the relevant paths 11,835 negative examples and maintained a 57:43 ratio on
obtained are: lpf; llf; lllf; llpf and lppf. positive and negative examples. The classifier is imple-
mented using the Caret [16] package in R language.
2.3 AvgSim
To find relatedness between two objects in HIN, D. Xiao 2.5 Performance Evaluation Metrics
et al. [15] proposed a measure called AvgSim. AvgSim value The prediction performance is validated by k-fold cross-val-
of two objects is the average of reachable probability under idation. True Positive (TP) and True Negative (TN) repre-
the given meta-path and the reverse path. Given a meta- sent the samples which are correctly predicted as positives
path P , then the AvgSim between source object s and target and negatives respectively. False Positive (FP) and False
t is given by Negative (FN) represent the number of positive and nega-
tive samples which are wrongly predicted. They are
1 obtained as follows:
AvgSimðs; tjP Þ ¼ ½RW ðs; tjP Þ þ RW ðs; tjP 1 Þ:
2 The probabilities assigned by the classifier to each associ-
ation is taken as the predictor score for that association. All
Where RW ðs; tjP Þ is the Random walk from s to t along the
pairs are sorted based on the predictor score. Number of
path P . The equation can be expanded by decomposing the
known associations with higher predictor score than spe-
meta-path P . Assume P is a composition of relations
cific threshold score position are True Positives and the
R1 ; R2 ; . . . ; Rl , then
number of unknown associations with a lower predictor
jOðsjR
X1 Þj
1 score than threshold are True Negatives. False Positives are
RW ðs; tjR1 ; R2 ; . . . ; Rl Þ ¼ the number of unknown associations with predictor score
jOðsjR1 Þj i¼1
above threshold whereas False Negatives are known associ-
RW ðOi ðsjR1 Þ; tjR2 ; . . . ; Rl Þ ations with a score below the threshold.
By varying the threshold value, Receiver Operating
1; if s and t are same Characteristics (ROC) curves are drawn. The performance
and RW ðs; tÞ ¼ : is further evaluated using statistical metrics such as accu-
0; otherwise
racy, precision, recall and f-score. Another metric called cov-
Where jOðsjR1 Þj is the number of out-neighbours of s based erage is used to compare performance with existing
on relation R1 . If there is no out-neighbour for s under R1 , approaches. It is the proportion of lncRNAs annotated with
then the relatedness value of s and t is 0. functions to the total number of lncRNAs considered.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 261
TABLE 2
Performance Measures
3 RESULTS
The model predicted new functions for 2,695 lncRNAs with
73.68 percent accuracy. The values of other performance
measures are given in Table 2. Some of the lncRNAs were
Fig. 4. Comparison of different classification models.
predicted to have multiple functions. The function predic-
tion results show that lncRNAs are mostly involved in bio-
2.6 Choice of the Classifier logical process rather than cellular functions or molecular
Before performing the actual prediction process, the best functions. The method was able to predict functions of
model that suits the input data and yields the best result has many lncRNA which were previously unknown.
been determined. This is achieved by comparing the per- The functions are taken from GO consortium. The GO
formances of different classification models for the selected ontology follows parent-child relationships among them
data set and experimental set up. The candidate models based on certain functional categories called GOSlim,
were: (a) Artificial Neural Network (ANN), (b) Gradient- GOBasic, etc. Here the functional GO Terms are classified
Boosting Machine (GBM), (c) Generalised Linear Models based on their GOSlim category to understand various
(GLM) (d) Random Forest (RF) and (e) Support Vector kinds of functions performed by lncRNAs. The category-
Machine (SVM). Each model is implemented in R language wise list is shown in Table 3. The table shows a list of func-
with the default parameters and is evaluated using 10-fold tional categories with the count of GO terms belonging to
cross-validation. The results are summarised in Fig. 4. The that category. It shows that the important functions
TABLE 3
Important Functions of LncRNAs
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
262 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
3.1 Comparison With Other Approaches llpf; lllf (from Section 2.2). The RF classifier ranks the features
A comparison of our model with two state-of-the art mod- based on their importance and leverage in obtaining the final
els, LncRNA2Function [13] and NeuraNetL2GO [42] is prediction result. This ranking is graphically displayed in
done using a separate test set, lncRNA2GO-55, provided for Fig. 6. The paths lllf; llpf; llf are high ranked. In an experimen-
free download by Zhang et al. in NeuraNetL2GO [42]. The tal setup with the number of direct lncRNA-protein interac-
performance is compared using the metrics of F-score (F), tions are limited, the incorporation of meta-path based
precision (PRE) and recall (REC) and our model is found to approach helped to reveal functions of lncRNA through the
perform the best. Fig. 5 shows the results of the performance path lllf and llf by incorporating co-expression networks of
comparison. In terms of coverage (the ratio of annotated lncRNAs.
lncRNAs to total lncRNAs) as well, the proposed method
has shown satisfactory performance (Table 4). 4.1.2 Justification of Relevant Path Selection
This section justifies the correctness of relevant meta-path
4 DISCUSSION selection process described in Section 2.2. The entire explana-
tion here is based on Fig. 7. It may be recalled from Section 2.2
This section is subdivided into two. In the first part, the vari-
that the meta-paths lpf, llf, lllf, llpf, lppf were determined to
ous measures used in selecting meta-paths are described. In
be relevant and lplf to be irrelevant.
the second part, the impact of lncRNA co expression sub-
network in the final result is explained.
TABLE 4
Comparison With Existing Methods for an Fig. 7. Verification of relevant path selection with accuracy. The triangles
Independent Test of 55 lncRNAs represents the accuracy when all paths are considered. Circles repre-
sent the accuracy when the individual paths are removed. Diamonds
Model Annotated Coverage (%) represent the accuracy when a particular path is removed along with
lplf. The accuracy got increased when path lplf was removed indicating
LncRNA2Function 18 32.7 that it is irrelevant. The accuracy is improved when less correlating lpf,
NeuranetL2GO 50 90.9 llf and lppf are removed individually. However, when they are removed
Proposed Model 48 87.3 along with irrelevant lplf the accuracy is decreased. This means that
when only lplf is removed, the prediction performance is enhanced.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 263
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
264 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
TABLE 5 TABLE 6
Predicted Functions of HOTAIR Predicted Functions of H19
5 CASE STUDY
In order to further demonstrate the predictive performance of
It is the primary example of an RNA expressed on one chromo-
the proposed model, a case study containing two well known
some that has been found to influence transcription of another
lncRNAs is conducted. The lncRNAs selected are HOTAIR
chromosome. The HOTAIR gene contains 6,232 bp and encodes
and H19 with respective NONCODE identifiers NON-
2.2 kb lncRNA molecule. HOTAIR is associated with many dis-
HSAG011264, NONHSAG007409. The case study primarily
eases and its aberrant expression causes the progression of vari-
focuses on the predicted functional association of these two
ous cancers. It is classified as an oncogenic lncRNA.
lncRNAs that are clinically established. However, it covers a
Twenty-five major functions of HOTAIR predicted by the
few of the most prominent functional associations predicted
model are shown in Table 5. The study successfully pre-
by the proposed model, even though they are not experimen-
dicted almost all the function associations of HOTAIR, for
tally validated. These are tagged ‘not reported’ in the respec-
which evidence exist in current literature.
tive tables. The list of ‘not reported’ cases are not exhaustive.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 265
Syndrome and acts as tumour suppressor in some cancers. [7] H. Ma, Y. Hao, X. Dong, Q. G. J. Chen, J. Zhang, and W. Tian,
“Molecular mechanisms and function prediction of long noncod-
It is involved in all stages of tumorigenesis. It has a highly ing RNA,” The Sci. World J., vol. 2012, no. 3, Nov. 2012, Art. no. 11.
conserved structure and its function depends on the struc- [8] J. L. Rinn and H. Y. Chang, “Genome regulation by long noncod-
ture [33]. H19 is associated with hypertension, coronary ing RNAs,” Annu. Rev. Biochem., vol. 81, no. 1, pp. 145–166, 2012.
artery disease, atherosclerosis, ischemia, and heart failure. [9] Q. Liao et al., “Large-scale prediction of long non-coding RNA
functions in a coding–non-coding gene co-expression network,”
[34]. Twenty-five important functions of H19, predicted by Nucleic Acids Res., vol. 39, no. 9, pp. 3864–3878, 2011.
the model are summarised in Table 6. It can be observed [10] X. Guo et al., “Long non-coding RNAs function annotation: A
that the most important and experimentally proven func- global prediction method based on bi-colored networks,” Nucleic
Acids Res., vol. 41, no. 2, 2013, Art. no. e35.
tion associations of the lncRNA, H19 are predicted success- [11] Y. Xiao et al., “Predicting the functions of long noncoding RNAs
fully by the model. using RNA-seq based on Bayesian network,” BioMed Res. Int.,
vol. 2015, pp. 1–14, Mar. 2015.
[12] F. Chen and Y.-P. P. Chen, “Exploring the ncRNA-ncRNA pat-
6 CONCLUSION AND FUTURE WORK terns based on bridging rules,” J. Biomed. Informat., vol. 43, no. 3,
Growing evidence for functional roles played by lncRNAs pp. 569–577, 2010.
[13] Q. Jiang et al., “LncRNA2Function: A comprehensive resource for
in biological and cellular activities shaped the contemporary functional investigation of human lncRNAs based on RNA-seq
research issue of fast and efficient functional annotation of data,” BMC Genomics, vol. 16, no. 3, 2015, Art. no. S2.
lncRNAs. Since the wet-lab process to functionally annotate [14] J. Yang, A. Li, M. Ge, and M. Wang, “Relevance search for predict-
lncRNAs is expensive and tedious, computational alterna- ing lncRNA-protein interactions based on heterogeneous
network,” Neurocomputing, vol. 206, no. 3, pp. 81–88, 2016.
tives have drawn tremendous research attention these days. [15] D. Xiao, X. Meng, Y. Li, C. Shi, and B. Wu, “AVGSIM: Relevance
The work proposed here, predicts the lncRNA functions measurement on massive data in heterogeneous networks,” J.
from their protein interaction data and co-expression Theor. Appl. Inf. Technol., vol. 84, pp. 101–110, Feb. 2016.
[16] M. Kuhn et al., “Caret package,” J. Statist. Softw., vol. 28, no. 5,
details. While the existing methods mostly concentrate on pp. 1–26, 2008.
protein interaction of lncRNAs for functional annotation, [17] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey of heteroge-
this method considers lncRNA co-expression similarity and neous information network analysis,” IEEE Trans. Knowl. Data
association with existing functions, in addition to protein Eng., vol. 29, no. 1, pp. 17–37, Jan. 2017.
[18] S. Yizhou and H. Jiawei, Mining Heterogeneous Information Net-
interaction. More importantly, the method associates func- works- Principles and Methodologies. San Rafael, CA, USA: Morgan
tions with lncRNAs even if they lack protein interaction. and Claypool, 2012.
The study demonstrates that AvgSim applied along meta- [19] Y. Hao et al., “NPInter v3.0: An upgraded database of noncoding
RNA-associated interactions,” Database, vol. 2016, 2016, Art. no.
paths can effectively evaluate the relevance of lncRNA- baw057.
function pairs in an HIN. The model yielded an overall pre- [20] Y. Zhao et al., “NONCODE 2016: An informative and valuable
diction accuracy of 74 percent. data source of long non-coding RNAs,” Nucleic Acids Res., vol. 44,
One possible research direction in future is the incorporation no. D1, pp. D203–D208, 2016.
[21] Y. Xiao, J. Zhang, and L. Deng, “Prediction of lncRNA-protein
of more information about lncRNA to the network. LncRNA is interactions using HeteSim scores based on heterogeneous
proven to interact with different kinds of biomolecules such as networks,” Sci. Rep., vol. 7, no. 1, 2017, Art. no. 3664.
RNA, miRNA and siRNA. Integration of such interaction [22] A. Li, M. Ge, Y. Zhang, C. Peng, and M. Wang, “Predicting long
details to the model may enhance the prediction performance. noncoding RNA and protein interactions using heterogeneous
network model,” BioMed Res. Int., vol. 2015, 2015, Art. no. 671950.
Accuracy of the results may be further enhanced by the incor- [23] A. Franceschini et al.,, “STRING v9.1: Protein-protein interaction
poration of more number of biological characteristics of networks, with increased coverage and integration,” Nucleic Acids
lncRNAs to the model. That apart, the present work uses a cor- Res., vol. 41, no. D1, pp. D808–15, 2013.
[24] The UniProt Consortium, “UniProt: The universal protein knowl-
relation-based procedure to set the upper threshold of meta- edgebase,” Nucleic Acids Res., vol. 45, no. D1, pp. D158–D169, 2017.
path length. Replacement of this approach by a generic and [25] J. B. Zhi-Liang Hu and J. M. Reecy, “CateGOrizer: A web-based
well-formulated algorithm can make the methodology more program to batch analyze gene ontology classification categories,”
effective. This will indeed benefit all HIN mining tasks using Online J. Bioinf., vol. 9, no. 2, pp. 108–112, 2008.
[26] J. L. Rinn et al., “Functional demarcation of active and silent chro-
meta-paths, irrespective of their application domains. matin domains in human HOX loci by noncoding RNAs,” Cell,
vol. 129, no. 7, pp. 1311–1323, 2007.
REFERENCES [27] R. Kogo et al., “Long noncoding RNA HOTAIR regulates poly-
comb-dependent chromatin modification and is associated with
[1] M. Sun and W. L. Kraus, “From discovery to function: The poor prognosis in colorectal cancers,” Cancer Res., vol. 71, no. 20,
expanding roles of long noncoding RNAs in physiology and dis- pp. 6320–6326, 2011.
ease,” Endocrine Rev., vol. 36, no. 1, pp. 25–64, 2015. [28] T. Gutschner and S. Diederichs, “The hallmarks of cancer: A long non-
[2] P. Johnsson, L. Lipovich, D. Grander, and K. V. Morris, coding RNA point of view,” RNA Biol., vol. 9, pp. 703–19, Jun. 2012.
“Evolutionary conservation of long non-coding RNAs; sequence, [29] M.-C. Tsai, R. C. Spitale, and H. Y. Chang, “Long intergenic non-
structure, function,” Biochimica et Biophysica Acta (BBA)-General coding RNAs: New links in cancer progression,” Cancer Res.,
Subjects, vol. 1840, no. 3, pp. 1063–1071, 2014. vol. 71, no. 1, pp. 3–7, 2011.
[3] J. E. Wilusz, H. Sunwoo, and D. L. Spector, “Long noncoding [30] Y. Li et al., “LncRNA ontology: Inferring lncRNA functions based
RNAs: Functional surprises from the RNA world,” Genes Develop., on chromatin states and expression patterns,” Oncotarget, vol. 6,
vol. 23, no. 13, pp. 1494–1504, 2009. no. 37, 2015, Art. no. 39793.
[4] J. S. Mattick and I. V. Makunin, “Non-coding RNA,” Hum. Mol. [31] T. Yiwei, H. Hua, G. Hui, M. Mao, and L. Xiang, “HOTAIR inter-
Genetics, vol. 15, no. suppl_1, pp. R17–R29, 2006. acting with MAPK1 regulates ovarian cancer skov3 cell prolifera-
[5] J. E. Wilusz, H. Sunwoo, and D. L. Spector, “Long non coding tion, migration, and invasion,” Med. Sci. Monitor: Int. Med. J. Exp.
RNAs: Functional surprises from the RNA world,” Genes Develop., Clin. Res., vol. 21, 2015, Art. no. 1856.
vol. 23, no. 13, pp. 1494–1504, 2009. [32] T.-L. Cheng and Z. Qiu, “Long non-coding RNA tagging and
[6] Q. Guo et al., “Comprehensive analysis of lncRNA-mRNA co- expression manipulation via CRISPR/Cas9-mediated targeted
expression patterns identifies immune-associated lncRNA bio- insertion,” Protein Cell, vol. 9, pp. 820–825, 2018.
markers in ovarian cancer malignant progression,” Sci. Rep., vol.
5, 2015, Art. no. 17683.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
266 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022
[33] E. Raveh, I. J. Matouk, M. Gilon, and A. Hochberg, “The H19 long Adheeba Thahsin received the PG degree in
non-coding rna in cancer initiation, progression and metastasis– computer science and engineering from the
A proposed unifying theory,” Mol. Cancer, vol. 14, no. 1, 2015, Department of Computer Science and Engineer-
Art. no. 184. ing, National Institute of Technology Calicut, India
[34] C. P. Gomes et al., “The function and therapeutic potential of long in 2018. She currently works for Nokia Networks
non-coding RNAs in cardiovascular development and disease,” Chennai, India. She is interested in bioinformat-
Mol. Ther. - Nucleic Acids, vol. 8, pp. 494–507, 2017. ics, data mining, and networks.
[35] C.-X. Li et al., “H19 lncRNA regulates keratinocyte differentiation
by targeting miR-130b-3p,” Cell Death Disease, vol. 8, no. 11, 2017,
Art. no. e3174.
[36] J. Zhou et al., “H19 lncRNA alters DNA methylation genome wide
by regulating S-adenosylhomocysteine hydrolase,” Nature Com-
mun., vol. 6, 2015, Art. no. 10221. Manju M received the PhD degree in life sciences
[37] W. Yang, N. Ning, and X. Jin, “The lncRNA H19 promotes cell from the University of Kerala, India, in 2011. Cur-
proliferation by competitively binding to miR-200a and derepress- rently she is working as an assistant professor
ing b-catenin expression in colorectal cancer,” BioMed Res. Int., with the Department of Zoology, KSM DB College
vol. 2017, 2017, Art. no. 2767484. Sasthamcotta, Kerala, India since July 2012. Her
[38] J. Zhou et al., “H19 lncRNA alters DNA methylation genome wide research interests include molecular biology, his-
by regulating S-adenosylhomocysteine hydrolase,” Nature Com- topathology, phytochemistry, and bioinformatics.
mun., vol. 6, 2015, Art. no. 10221.
[39] Y. Huang, Y. Zheng, C. Jin, X. Li, L. Jia, and W. Li, “Long non-
coding RNA H19 inhibits adipocyte differentiation of bone mar-
row mesenchymal stem cells through epigenetic modulation of
histone deacetylases,” Sci. Rep., vol. 6, 2016, Art. no. 28897.
Gopakumar G (Member, IEEE) received the PhD
[40] S.-C. Tao, B.-Y. Rui, Q.-Y. Wang, D. Zhou, Y. Zhang, and
degree in bioinformatics from the University of
S.-C. Guo, “Extracellular vesicle-mimetic nanovesicles transport
LncRNA-H19 as competing endogenous RNA for the treatment of Kerala, India, in 2013. Currently he is working as
diabetic wounds,” Drug Delivery, vol. 25, no. 1, pp. 241–255, 2018. an assistant professor with the Department of
[41] J. Pan, “LncRNA H19 promotes atherosclerosis by regulating Computer Science and Engineering, National
MAPK and NF-kB signaling pathway,” Eur. Rev. Med. Pharmacol. Institute of Technology Calicut, India since July
Sci., vol. 21, no. 2, pp. 322–328, 2017. 2010. His research interests include RNA bioin-
formatics, biological network analysis, and data
[42] J. Zhang, Z. Zhang, Z. Wang, Y. Liu, and L. Deng, “Ontological
function annotation of long non-coding RNAs through hierarchical mining. He is a member of the ACM.
multi-label classification,” Bioinformatics, vol. 34, pp. 1750–1757,
Dec. 2017.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.