You are on page 1of 12

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO.

1, JANUARY/FEBRUARY 2022 255

A Heterogeneous Information Network Model for


Long Non-Coding RNA Function Prediction
Sunil Kumar P V , Adheeba Thahsin , Manju M, and Gopakumar G

Abstract—Exciting information on the functional roles played by long non-coding RNA (lncRNA) has drawn substantial research
attention these days. With the advent of techniques such as RNA-Seq, thousands of lncRNAs are identified in very short time spans.
However, due to the poor annotation rate, only a few of them are functionally characterised. The wet lab experiments to elucidate
lncRNA functions are challenging, slow progressing and sometimes prohibitively expensive. This work attempts to solve the crucial
problem of developing computational methods to predict lncRNA functions. The model presented here, predicts the functions of
lncRNAs by making use of a meta-path based measure, AvgSim on a Heterogeneous Information Network (HIN). The network is
constructed from existing protein and function association data of lncRNAs, lncRNA co-expression data and protein protein interaction
data. Out of the 2,758 lncRNA considered for the experiment, the proposed method predicts possible functions for 2,695 lncRNAs with
an accuracy of 73.68 percent and found to perform better than the other state-of-the-art approaches for an independent test set. A case
study of two well-known lncRNAs (HOTAIR and H19) is conducted and the associated functions are identified. The results were
validated using experimental evidence from the literature. The script and data used for the implementation of the model is freely
available at: http://bdbl.nitc.ac.in/LncFunPred/index.html.

Index Terms—LncRNA, heterogeneous information network, meta-path, classification, random forest, AvgSim

1 INTRODUCTION annotate lncRNA functions are expensive, time consuming


and tiresome [6]. Thus the development of computational
non-coding RNAs are RNA transcripts longer than
L ONG
200 nucleotides. They lack key properties such as pro-
tein coding potential and sequence conservation, which are
alternatives that predict lncRNA functions is a pressing
requirement in lncRNA research.
The naive approaches to predict functions of biomole-
inevitable for functional roles [1], [2]. Until a decade ago,
cules require excessive usage of their sequence and struc-
genomics considered most of the non-coding sections of
tural information. However, this strategy is ineffective for
RNA as ‘junk’, containing no useful genetic information.
lncRNAs due to their low sequence conservation and poor
However, recent studies suggest that lncRNAs are involved
knowledge about the structure-function association. Conse-
in numerous biological activities like cell-type specific
quently, a new realm of research emerged wherein network
expression, localization to sub-cellular components, associa-
science is applied lncRNA research. The network-based
tion with human diseases and many more [3]. These biologi-
approaches rely more on the implicit information contained
cal and cellular properties implicate lncRNAs’ functionality
in the network than biological features of lncRNAs. Hence,
and they are no longer considered as ‘junk’ [4], [5].
such efforts became popular in short time spans.
Although a large number of lncRNAs have been identi-
Network-based methods have their foundation on two
fied so far, the number of properly annotated lncRNAs is
principles. First is “Guilt by association” principle which
not very exciting. The gap between rates of discovery and
states that genes regulating a biological process may be co-
annotation of lncRNAs account for the limited functional
expressed with genes involved in the same process [7]. The
knowledge about lncRNAs. Wet lab experiments to
second one is that biomolecules do interact with each other
while performing a function [8]. Therefore, while designing
 S.K. P V is with the Department of Computer Science and Engineering, a model that predicts the functions of lncRNAs, the interac-
National Institute of Technology Calicut, NIT Campus (PO), Kozhikkode, tions of lncRNAs including their co-expression deserve pre-
Kerala 673601, India, and also with the Department of CSE, CMR Insti- dominant consideration.
tute of Technology Bangalore, AECS Layot (PO), Bangalore, Karnataka
560037, India. E-mail: sunilkumarpv_p130073cs@nitc.ac.in. Majority of the existing approaches for computational
 A. Thahsin and G. G are with the Department of Computer Science and prediction of lncRNA functions use the interactive relation-
Engineering, National Institute of Technology Calicut, NIT Campus ships between lncRNA and protein as the primary input to
(PO), Kozhikkode, Kerala 673601, India.
E-mail: adeeba.thahsin@gmail.com, gopakumarg@nitc.ac.in.
construct the networks. Liao Q et al. [9] constructed coding
 M. M is with the Department of Zoology, KSM Devaswom Board College non-coding gene co-expression network (CNC) and
Sasthamkotta, Kollam, Kerala 690521, India. assigned functions to around 340 lncRNAs based upon the
E-mail: manjugopkumar@gmail.com. functions of its neighbourhood proteins. This work was one
Manuscript received 14 Aug. 2018; revised 27 Feb. 2020; accepted 3 June 2020. of the initial attempts towards lncRNA function prediction.
Date of publication 8 June 2020; date of current version 3 Feb. 2022. Here the possibility of protein interactions was not consid-
(Corresponding author: Sunil Kumar P V.)
Recommended for acceptance by R. Backofen. ered. A global lncRNA function prediction tool (lnc-GFP)
Digital Object Identifier no. 10.1109/TCBB.2020.3000518 was developed by Xingli Guo et al. [10] by incorporating
1545-5963 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
256 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

protein interaction data into the co-expression network used meta-path based information of HIN for prediction. Possible
in Liao et al. They annotated 1,625 lncRNAs with functional functions are assigned to a total of 2,695 lncRNAs by the
characteristics. But none of these methods used Next Gener- method. The correctness is verified statistically by cross-
ation Sequencing (NGS) data for processing. Later on, Yun validation and reconfirmed through mining recent litera-
Xiao et al. [11] used a Bayesian network of lncRNA and pro- ture. A case study of two well-studied lncRNAs is also
tein using the transcript profile made from RNA-Seq data. conducted and results are validated.
This method assigned functions to 762 lncRNAs by func- The rest of the paper is organised as follows: Section 2
tional enrichment of highly interconnected proteins. To deals with the input data, methodologies and algorithms
annotate ncRNA functions, Feng Chen and Yi-Ping Phoebe used for the construction of the prediction model. Various
Chen [12] have applied bridging rule mining. They have statistical parameters used to validate the prediction perfor-
used two different measures to explore the relationship mance of the model and the details of the classifier used are
between ncRNAs, one as the linearity measure and the also outlined. Section 3 explains the results obtained. A
other the non-linearity measure. Then based on the associa- detailed discussion on the implications of the results is pro-
tion rule, functions of ncRNAs are speculated. But this vided in Section 4. The outcomes of a case study conducted
method is not exclusively devoted to lncRNAs. Qinghua forms the matter for Section 5. Section 6 provides the con-
Jiang et al. [13] performed hypergeometric test on lncRNA- cluding remarks and future directions.
protein co-expression data from RNA-Seq to predict
lncRNA function. They mapped 9,625 lncRNAs to their 2 MATERIALS AND METHODS
function as well as pathways.
All the network-based methods found in current litera- This section describes the formulation of the prediction
ture focus on lncRNA-protein interaction as a crucial metric model and derivation of data set to conduct the experi-
to devise a model that predicts the functions of lncRNAs. ments. To begin with, the major HIN concepts used further
Hence these methods can predict the functions of lncRNAs in the discussion are formally defined.
that have known protein associations. In order to exploit the Heterogeneous Information Network
fullest advantage of a network based model, the lncRNA- An Information Network [17], [18] is defined as a directed
lncRNA links are also to be taken in to account. The pro- graph G ¼ ðV; EÞ with an object type mapping function
posed work considers lncRNA-lncRNA links as well and f : V ! A and a link type mapping function c : E ! R,
predicts the functions of lncRNAs even in the absence of where each object v 2 V belongs to one particular object
known protein associations. type fðvÞ 2 A, each link e 2 E belongs to a particular rela-
Here, the problem of predicting functions of lncRNAs tion cðeÞ 2 R, and if two links belong to the same relation
which do not have known protein association is addressed type, the two links share the same starting object type as
by the incorporation of lncRNA co-expression data. A Het- well as the ending object type.
erogeneous Information Network (HIN) is built with The information network is called heterogeneous infor-
lncRNA, protein and function as node types and (a) pro- mation network if the number of types of objects jAj > 1 or
tein-protein interaction, (b) protein-function association, (c) relations jRj > 1.
lncRNA-protein interaction, (d) lncRNA co-expression, and Network Schema
(e) known lncRNA-function association as edge types. The The network schema [17], [18], denoted as TG ¼ ðA; RÞ, is
application of the method AvgSim to quantify the degree of a meta-template for an information network G ¼ ðV; EÞ
relatedness of a pair of nodes in the HIN is inspired from with the object type mapping f : V ! A and the link type
the work by J. Yang et al. [14], who adopted AvgSim from mapping c : E ! R, which is a directed graph defined over
the proposal by D. Xiao et al. [15]. object types A, with edges as relations from R. The network
LncRNA-function pairs can be connected through vari- schema of a heterogeneous information network specifies
ous paths termed as meta-paths in HIN terminology. Each type constraints on the sets of objects and relationships
meta-path carries a semantic meaning, which needs to be among the objects.
interpreted properly. If an lncRNA is connected to a func- Meta-Path
tion through a protein node, the path is ‘lncRNA-protein- A meta-path [17], [18], P is a path defined on a schema
function’ and its semantic meaning is that the function is R1 R2
TG ¼ ðA; RÞ, and is denoted in the form A1 ! A2 !
performed by lncRNA through protein interaction. Simi- Rl
larly, meta-path ‘lncRNA-lncRNA-function’ is interpreted    ! Alþ1 which defines a composite relation R ¼ R1  R2 
as the function is performed by lncRNA molecules, in com- . . .  Rl between objects A1 ; A2 ; . . .; Alþ1 where  denotes the
bination. Meta-paths and their semantic meanings are dis- composition operator on relations.
cussed in more detail in Section 2.2.
After the HIN construction, the functionally relevant 2.1 Heterogeneous LncRNA-Protein-Function
meta-paths are extracted through a correlation analysis. Network (HLPFN)
The relatedness measure, AvgSim is computed along such Heterogeneous LncRNA-Protein-Function Network (HLPFN)
relevant meta-paths. The AvgSim score along various meta- is a heterogeneous interconnection of five different interaction
paths are combined to form the features for a Random For- networks constructed from protein interactions, lncRNA co-
est classifier which performs the prediction of lncRNA expression, lncRNA functional associations, and lncRNA-pro-
functions. tein interactions and protein-function association. All relation-
In contrast to the existing methods which use only ships are kept as separate adjacency matrices to maintain the
lncRNA-protein interaction, the proposed work exploits heterogeneity of nodes and edges. The various adjacency

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 257

matrices used to construct HIN are shown in Fig. 1a. The con- contains ten functions for 1,961 lncRNA. This data is consid-
struction of the adjacency matrices of the individual networks ered as the known associations between lncRNAs and func-
is described below. tions. The matrix for known lncRNA-function association,
MLF is defined as
2.1.1 LncRNA-Protein Interaction Network

LncRNA-protein interaction data is collected from NPInter 3.0 1; if LncRNAi has Functionj
MLF ði; jÞ ¼ :
[19]. It contains interaction of ncRNAs with different kinds of 0; otherwise
biomolecules. The lncRNAs are filtered from ncRNA using
NONCODE ID [20] and the interaction level is restricted to
‘RNA-protein’ to retrieve proteins. The interactions are 2.1.5 Protein Function Association Network
restricted to ‘Homo Sapiens’. The lncRNA-protein interaction The functions associated to proteins is obtained from Uni-
network is represented as an adjacency matrix with row Prot database [24], which consists of the mapping of protein
names as lncRNAs, and column names as proteins. The edges molecules to 17,073 unique functional GO Terms. The pro-
are added based on interaction data collected. The construc- tein-function association matrix is constructed from this
tion of adjacency matrix MLP is as follows: data by the following equation:

1; if LncRNAi interacts with Proteinj 
MLP ði; jÞ ¼ : 1; if Proteini has Functionj
0; otherwise MPF ði; jÞ ¼ :
0; otherwise

2.1.2 Protein-Protein Interaction Network


The protein interaction details are taken from STRING v10 HLFPN Construction
Database [23]. STRING provides interaction details of pro- The HLPFN is constructed by integrating these sub-net-
teins with biomolecules with a confidence score in the inter- works. It has been inferred from recent studies that, most of
val [0,1] corresponding to every interaction. In this work, a the time, lncRNA functions through interactions with pro-
protein pair is considered to be interacting if their confi- teins. The inclusion of lncRNA-protein interaction network
dence score is 0.3 or above. The organism is restricted to helps to fetch this information. If an lncRNA li interacts
homo sapiens. The protein interaction matrix MPP is con- with a protein pi and if pi interacts with another protein pj ,
structed as follows: then there is a probability for li to interact with protein pj .
The protein-protein interaction sub-network identifies such

1; if Sij > 0:3 relationships. Then lncRNA co-expression network imparts
MPP ði; jÞ ¼ : vital information about functions shared by similar
0; otherwise
lncRNAs. The functional association networks of lncRNAs
Where Sij is the confidence score between Proteini and and proteins are used to enrich the HIN with already
Proteinj known functions.
The final network (HLPFN), obtained by integrating the
2.1.3 LncRNA Co-Expression Network sub-networks, consists of thousands of nodes with dense
LncRNA expression data is downloaded from NONCODE interconnections. The network schema of HLPFN is shown
4.0 database [20]. It contains the expression profiles of in Fig. 1b. It contains three nodes of type lncRNA (l) func-
90,062 lncRNA transcripts in 24 different cell types. The tion (f) and protein (p). The link types between protein-pro-
Pearson Correlation Coefficient (PCC) between every pair tein and lncRNA-protein are interacts with, the link type
of lncRNAs is computed to determine the co-expressing between lncRNA and function is performs or performed by
lncRNAs. Let X ¼ fx1 ; . . . ; x24 g and Y ¼ fy1 ; . . .; y24 g are and the link type between protein and function is also per-
the expression profiles of lncRNA i and j respectively. Then forms or performed by. A portion of the HLPFN developed is
the lncRNA correlation matrix PLL is computed as follows: shown in Fig. 2. The numbers of nodes and links of different
  types are summarised in Table 1.
covðX; Y Þ
PLL ði; jÞ ¼  :
sX sY  2.2 Selection of Relevant Meta-Paths
In any HIN based study, meta-path is an influential parame-
Subsequently, the lncRNA adjacency matrix MLL is con- ter which plays a key role in extracting useful information
structed as from the HIN. In this work, to quantify the relatedness
 between lncRNA objects and function objects of HLPFN, a
1; if PLL ði; jÞ > 0
MLL ði; jÞ ¼ : meta-path based measure known as AvgSim [15] is used.
0; otherwise Since AvgSim is a path constrained measure, the selection
of relevant meta-paths is vital. In homogeneous networks
Followed by MLL ði; iÞ ¼ 0, to remove self loops. all paths between any pair of objects are semantically alike
and are relatively easy to handle. In contrast, HIN contains
2.1.4 LncRNA Function Association Network several kinds of paths between any given pair of objects.
The function of lncRNA predicted using lnc-GFP method is Each one of these significantly differ in semantics, even
available in NONCODEv4.0. The lnc-GFP set already though they connect the same pair of objects, and demands
includes the functions predicted by Liao Q et al. [9]. It meticulous manipulation.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
258 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

Fig. 1. Work flow of predicting lncRNA function. a) Heterogeneous lncRNA-protein-function network construction. First the homogeneous networks of
protein-protein interaction, lncRNA co-expression and lncRNA-function association are built from adjacency matrices of protein-protein interaction
data (STRING), lncRNA co-expression data (NONCODE v4.0) and lncRNA-function association data (Gene Ontology). LncRNA-protein interaction
(NPInter 3.0) matrix connects lncRNA co-expression and protein-protein interaction networks. They are connected to functions using known
lncRNA-function association data (NONCODE 4.0) and protein-function association data (Uniprot). (b) The network schema of heterogeneous
lncRNA-protein-function network consisting of three nodes lncRNA, protein, and function and edges connecting them. Meta-paths are generated by
doing depth first search in the network schema with the given path length. (c) Computation Avgsim score along all the meta-paths. (d) The score for
every lncRNA-function pair along different meta-paths is arranged as a row vector and a two-dimensional matrix is constructed by combining these
row vectors. This acts as the feature set to train the random forest classifier.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 259

TABLE 1
Details of LncRNA-Protein-Function Network

Type Count
Objects lncRNA 2758
protein 8501
function 1416
Links lncRNA-lncRNA 1947266
lncRNA-function 15689
lncRNA-protein 28317
protein-function 13658
protein-protein 58372

consider only the meta-paths shorter than the threshold.


While setting such a threshold, it is important to make sure
that the results are not affected badly due to the removal of
higher length paths. If results are affected, the threshold
value must be reconsidered and this practice rationalises
Fig. 2. A section of constructed HLPFN. The rectangles, ellipses and tri- the fixation of threshold limit. Generally, the threshold limit
angles represent lncRNAs, proteins and functions respectively. The
edges between them show association between the nodes. can be set randomly, empirically or heuristically.
In this work, an empirical approach is employed. Starting
from length three, the correlation of all lf meta-paths (that start
To demonstrate, consider the meta-paths of length three in
with lncRNA and end with function) with actually existing lf
HLPFN: lpf and llf, both connecting an lncRNA object and
associations are estimated iteratively. The point at which the
function object (for simplicity, the relationship names between
count of negatively correlated meta-paths equals half of the
the objects are omitted). The semantics of meta-paths can be
positively correlated meta-paths, the iteration stops. The path
determined by considering the relationships connecting them.
length at this point is taken as the length threshold. Ensuing
interacts with performs
Meta-path lpf means lncRNA  ! protein ! two paragraphs explain this process in detail.
co  expresses with In the first step, an iterative process on the meta-path
function. Whereas, llf means lncRNA !
performs length is performed in which the candidate meta-paths to
lncRNA ! function. The former meta-path corre- be included in the experiment are identified. As the problem
sponds to the functions performed by lncRNA by interacting is to find the relatedness between lncRNA and function,
with proteins. While the latter example represents a meta-math only those meta-paths which start with lncRNA (l) and end
with functions shared by co-expressing lncRNAs or similar with function (f) are considered. The iteration must start
lncRNAs. with the least possible meta-path: lf, of length two. In fact,
A relevant meta-path is a meta-path which conveys use- this path represents the existing functional association of
ful information to solve an HIN based problem at hand. In lncRNAs and corresponds to direct lf links in the HIN.
most of the cases, relevant meta-paths are suggested by Thus in practice, the iteration starts with a meta-path length
domain experts based on the peculiarities of the problem of three. In every iteration, a vector is formed with the total
and the input data set. In our problem, the exact knowledge number of meta-paths of that type between every pair of
about the mechanism in which each lncRNA functions is lncRNAs and functions, called meta-path vector.
not known. Therefore, every meta-path in the HLFPN needs The second step is choosing the relevant meta-paths. If
to be considered equally relevant. This assumption necessi- the meta-path under consideration is relevant, then it fol-
tates the selection process of relevant meta-paths to be lows from network properties that its count between l and f
automated. would be more. This implies that the lncRNA is quite likely
The brute-force approach is to consider all the meta- to perform that function. Conversely, if the number of meta-
paths which connects lncRNAs and functions and check for paths happens to be less, then the probability of the lncRNA
their validity. This approach is infeasible because the num- to perform that function is small. In order to quantify this
ber of possible meta-paths increases exponentially with the trend, a correlation analysis of meta-path vector is per-
increase in meta-path length (the number of objects in the formed against the known lf association vector. All meta-
meta-path), making the problem intractable. One solution is paths in the meta-path vector showing positive correlation
to go ahead with the typical trends found in HIN studies. with known lf associations are concluded to be relevant
Generally in an HIN, it is meaningless to increment the and others to be irrelevant. The terminating criterion is for-
meta-path length indefinitely. Obviously, in lengthy meta- mulated in such a way that the iteration stops when the
paths, the chain of relations connecting starting and ending count of negatively correlated meta-paths equals half of the
objects can be very long. Hence unintelligent incrementa- positively correlated meta-paths. As stated already, the iter-
tion of meta-path length often proves counterproductive ation starts with a meta-path length of three and repeated
and yields results with much lesser accuracy [18]. In the until the terminating condition is met. In this work, the ter-
wake of this knowledge, most HIN based techniques set an minating criterion is met at a meta-path length of four,
upper threshold for meta-path length, in advance. They which is taken as the length threshold.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
260 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

from Table 1). The AvgSim score along each relevant meta-
path for every lncRNA-function pair is computed and rep-
resented as an lf matrix. There will be a separate matrix
for each meta-path as shown in Fig. 1c. For any meta-path k,
the entry ði; jÞ of its corresponding matrix represents the
AvgSim score along the meta-path (k) between lncRNA i
and function j. The matrices for all meta-paths are depicted
in Fig. 1c. As explained in Section 2.2, there are five relevant
meta-paths in the experiment. The feature set for the classi-
fier is constructed by deriving another matrix from the indi-
vidual meta-path matrices shown in Fig. 1c. The header
column of this new matrix is the lncRNA-function pairs and
header row is the meta-paths. The entries represent the
AvgSim score between the lncRNA-function pair provided
in the header row for the meta-path provided as the column
header. This matrix is used as the feature set to train the
classifier and has the structure provided in Fig. 1d. There
are two class labels, one representing positive class which
Fig. 3. Relevant meta-path selection: Correlation analysis for relevant indicates the lncRNA and function are associated and the
meta-path selection with meta-path lengths 3 (dotted curve), 4 (dashed other representing the negative class, which indicates the
curve) and 5 (solid curve). The meta-paths lpf and llf have positive cor-
relation with known data set. The meta-paths except lplf of path length
lncRNA and function are not associated.
four have positive correlation. Three meta-paths of length five are nega- The lncRNA-function association data, taken from NON-
tively correlated with known lf associations. CODE repository provided 15,689 positive examples. Ide-
ally, all the non-associated pairs must form negative
The correlation analysis of meta-path vector of lengths examples. In such a scenario, the number of negative exam-
three, four, and five are shown in Fig. 3. It is clear that in the ples would be far more than that of positive examples and
third iteration, there exist three negatively correlated paths can lead to a skewed training set. To avoid that, we applied
among the eight possible meta-paths of length five. Hence a random sampling on the non-associated pairs and selected
the path length is fixed to be four. Thus, the relevant paths 11,835 negative examples and maintained a 57:43 ratio on
obtained are: lpf; llf; lllf; llpf and lppf. positive and negative examples. The classifier is imple-
mented using the Caret [16] package in R language.
2.3 AvgSim
To find relatedness between two objects in HIN, D. Xiao 2.5 Performance Evaluation Metrics
et al. [15] proposed a measure called AvgSim. AvgSim value The prediction performance is validated by k-fold cross-val-
of two objects is the average of reachable probability under idation. True Positive (TP) and True Negative (TN) repre-
the given meta-path and the reverse path. Given a meta- sent the samples which are correctly predicted as positives
path P , then the AvgSim between source object s and target and negatives respectively. False Positive (FP) and False
t is given by Negative (FN) represent the number of positive and nega-
tive samples which are wrongly predicted. They are
1 obtained as follows:
AvgSimðs; tjP Þ ¼ ½RW ðs; tjP Þ þ RW ðs; tjP 1 Þ:
2 The probabilities assigned by the classifier to each associ-
ation is taken as the predictor score for that association. All
Where RW ðs; tjP Þ is the Random walk from s to t along the
pairs are sorted based on the predictor score. Number of
path P . The equation can be expanded by decomposing the
known associations with higher predictor score than spe-
meta-path P . Assume P is a composition of relations
cific threshold score position are True Positives and the
R1 ; R2 ; . . . ; Rl , then
number of unknown associations with a lower predictor
jOðsjR
X1 Þj
1 score than threshold are True Negatives. False Positives are
RW ðs; tjR1 ; R2 ; . . . ; Rl Þ ¼ the number of unknown associations with predictor score
jOðsjR1 Þj i¼1
above threshold whereas False Negatives are known associ-
RW ðOi ðsjR1 Þ; tjR2 ; . . . ; Rl Þ ations with a score below the threshold.
By varying the threshold value, Receiver Operating

1; if s and t are same Characteristics (ROC) curves are drawn. The performance
and RW ðs; tÞ ¼ : is further evaluated using statistical metrics such as accu-
0; otherwise
racy, precision, recall and f-score. Another metric called cov-
Where jOðsjR1 Þj is the number of out-neighbours of s based erage is used to compare performance with existing
on relation R1 . If there is no out-neighbour for s under R1 , approaches. It is the proportion of lncRNAs annotated with
then the relatedness value of s and t is 0. functions to the total number of lncRNAs considered.

2.4 Details of the Classifier


Let l be the total number of lncRNAs and f be the total num- Correctly annotated lncRNAs
Coverage ¼ :
ber of functions in the experiment (l = 2,758 and f = 1,416 Total lncRNAs

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 261

TABLE 2
Performance Measures

Accuracy 73.68% Precision 71.75%


Recall 72.73% F-score 72.23%

RF model showed the best accuracy among the five candi-


date models and hence selected for the experiment.

3 RESULTS
The model predicted new functions for 2,695 lncRNAs with
73.68 percent accuracy. The values of other performance
measures are given in Table 2. Some of the lncRNAs were
Fig. 4. Comparison of different classification models.
predicted to have multiple functions. The function predic-
tion results show that lncRNAs are mostly involved in bio-
2.6 Choice of the Classifier logical process rather than cellular functions or molecular
Before performing the actual prediction process, the best functions. The method was able to predict functions of
model that suits the input data and yields the best result has many lncRNA which were previously unknown.
been determined. This is achieved by comparing the per- The functions are taken from GO consortium. The GO
formances of different classification models for the selected ontology follows parent-child relationships among them
data set and experimental set up. The candidate models based on certain functional categories called GOSlim,
were: (a) Artificial Neural Network (ANN), (b) Gradient- GOBasic, etc. Here the functional GO Terms are classified
Boosting Machine (GBM), (c) Generalised Linear Models based on their GOSlim category to understand various
(GLM) (d) Random Forest (RF) and (e) Support Vector kinds of functions performed by lncRNAs. The category-
Machine (SVM). Each model is implemented in R language wise list is shown in Table 3. The table shows a list of func-
with the default parameters and is evaluated using 10-fold tional categories with the count of GO terms belonging to
cross-validation. The results are summarised in Fig. 4. The that category. It shows that the important functions

TABLE 3
Important Functions of LncRNAs

GO Class ID Definition Count GO Class ID Definition Count


GO:0008150 biological_process 2536 GO:0016301 kinase activity 23
GO:0008152 metabolism 757 GO:0005886 plasma membrane 22
GO:0003674 molecular_function 511 GO:0004871 signal transducer activity 21
GO:0007275 development 362 GO:0007610 behavior 21
GO:0005575 cellular_component 359 GO:0005654 nucleoplasm 19
GO:0016043 cell organization and biogenesis 344 GO:0005102 receptor binding 19
GO:0005623 cell 315 GO:0016032 viral life cycle 19
GO:0007154 cell communication 299 GO:0004672 protein kinase activity 18
GO:0007165 signal transduction 263 GO:0003677 DNA binding 18
GO:0019538 protein metabolism 247 GO:0040029 regulation of gene expression& epigenetic 18
GO:0006810 transport 246 GO:0009607 response to biotic stimulus 17
GO:0009058 biosynthesis 227 GO:0003723 RNA binding 17
GO:0005488 binding 227 GO:0005739 mitochondrion 16
GO:0030154 cell differentiation 194 GO:0030234 enzyme regulator activity 15
GO:0003824 catalytic activity 191 GO:0008233 peptidase activity 14
GO:0006950 response to stress 176 GO:0016049 cell growth 14
GO:0009653 morphogenesis 156 GO:0000166 nucleotide binding 14
GO:0006464 protein modification 155 GO:0008092 cytoskeletal protein binding 13
GO:0006996 organelle organization and biogenesis 149 GO:0008289 lipid binding 11
GO:0005737 cytoplasm 122 GO:0005576 extracellular region 10
GO:0005515 protein binding 122 GO:0005783 endoplasmic reticulum 10
GO:0006629 lipid metabolism 96 GO:0005768 endosome 10
GO:0007049 cell cycle 81 GO:0000228 nuclear chromosome 10
GO:0006811 ion transport 79 GO:0005794 Golgi apparatus 9
GO:0009719 response to endogenous stimulus 75 GO:0005773 vacuole 8
GO:0009605 response to external stimulus 74 GO:0003700 transcription factor activity 7
GO:0016787 hydrolase activity 71 GO:0005764 lysosome 6
GO:0009056 catabolism 71 GO:0004518 nuclease activity 6
GO:0006259 DNA metabolism 69 GO:0005198 structural molecule activity 6
GO:0016740 transferase activity 68 GO:0005730 nucleolus 5
GO:0000003 reproduction 66 GO:0019748 secondary metabolism 4

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
262 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

Fig. 5. Performance comparison with the methods of LncRNA2Function


and NeuranetL2GO.

performed by lncRNAs are biological process, metabolism,


development, cellular component and molecular function. Fig. 6. Relative importance of meta-paths in RF.
These results are obtained using the CateGOrizer tool [25].

3.1 Comparison With Other Approaches llpf; lllf (from Section 2.2). The RF classifier ranks the features
A comparison of our model with two state-of-the art mod- based on their importance and leverage in obtaining the final
els, LncRNA2Function [13] and NeuraNetL2GO [42] is prediction result. This ranking is graphically displayed in
done using a separate test set, lncRNA2GO-55, provided for Fig. 6. The paths lllf; llpf; llf are high ranked. In an experimen-
free download by Zhang et al. in NeuraNetL2GO [42]. The tal setup with the number of direct lncRNA-protein interac-
performance is compared using the metrics of F-score (F), tions are limited, the incorporation of meta-path based
precision (PRE) and recall (REC) and our model is found to approach helped to reveal functions of lncRNA through the
perform the best. Fig. 5 shows the results of the performance path lllf and llf by incorporating co-expression networks of
comparison. In terms of coverage (the ratio of annotated lncRNAs.
lncRNAs to total lncRNAs) as well, the proposed method
has shown satisfactory performance (Table 4). 4.1.2 Justification of Relevant Path Selection
This section justifies the correctness of relevant meta-path
4 DISCUSSION selection process described in Section 2.2. The entire explana-
tion here is based on Fig. 7. It may be recalled from Section 2.2
This section is subdivided into two. In the first part, the vari-
that the meta-paths lpf, llf, lllf, llpf, lppf were determined to
ous measures used in selecting meta-paths are described. In
be relevant and lplf to be irrelevant.
the second part, the impact of lncRNA co expression sub-
network in the final result is explained.

4.1 Measures Taken in Meta-Paths Selection


Here, a discussion on the relative importance of meta-paths
as features for the RF classifier is provided. Further an ana-
lytic confirmation that the paths identified by the relevant
path selection process are truly relevant is furnished. The
final part establishes the correctness in setting the upper
threshold for meta-path length as four.

4.1.1 Relative Importance of Meta-Paths


The various meta-paths within the length threshold that con-
stituted the feature set for the classifier were lpf; llf; lplf; lppf;

TABLE 4
Comparison With Existing Methods for an Fig. 7. Verification of relevant path selection with accuracy. The triangles
Independent Test of 55 lncRNAs represents the accuracy when all paths are considered. Circles repre-
sent the accuracy when the individual paths are removed. Diamonds
Model Annotated Coverage (%) represent the accuracy when a particular path is removed along with
lplf. The accuracy got increased when path lplf was removed indicating
LncRNA2Function 18 32.7 that it is irrelevant. The accuracy is improved when less correlating lpf,
NeuranetL2GO 50 90.9 llf and lppf are removed individually. However, when they are removed
Proposed Model 48 87.3 along with irrelevant lplf the accuracy is decreased. This means that
when only lplf is removed, the prediction performance is enhanced.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 263

Fig. 9. Effect of lncRNA co-expression data on prediction result. The


solid line is the ROC curve for the experiment when all interactions are
Fig. 8. ROC curve for prediction results for various meta-path length
considered. The dashed line represents the experiment with lncRNA co-
thresholds. The dotted line shows ROC curve obtained by meta-path length
expression data is removed.
threshold as three. It spans lesser area than thresholds of four and five
given in dashed and solid types respectively. The ROC curves of thresholds
four and five are almost overlapping and span nearly equal areas. of lncRNAs/proteins/functions present in the experiment, m
and g represent the number and length of meta-paths respec-
The experiment is conducted by including all the paths tively, l and f represent the number of lncRNAs and functions
up to a length of four and measuring the accuracy. At first, in the experiment. It may be noted that the number and length
the irrelevant path suggested by the methodology (lplf) is of meta-paths included in the experiment (m) play a major
removed. The accuracy was improved by this, affirming role in determining the overall complexity. Moreover, these
that the path as truly irrelevant. This is clear from Fig. 7. two are the parameters which we have the flexibility to curtail
Although, the paths lpf, llf and lppf are positively corre- by selecting or discarding meta-paths according to the length
lated, they have low correlation coefficient compared to lllf thresholds fixed in advance.
and llpf (Refer to Fig. 3, in Section 2.2 to note correlation An ROC curve analysis reveals that this selection is opti-
coefficients). The removal of paths with high correlation mum, as the improvement in result for longer meta-paths is
coefficient lllf and llpf drops the accuracy, indicating them disproportionate to the computational time invested for
as truly relevant. Whereas, the accuracy is not reduced (in processing more number of meta-paths of higher lengths.
fact, got improved) when positively correlated meta-paths The ROC curves for experiments with path lengths three,
with low correlation coefficients (lpf, llf and lppf) are four and five is given in Fig. 8. The length four meta-path
removed individually. Still they are not concluded to be experiment shows significantly better performance than
truly irrelevant because when each one of them is removed that with three. Whereas, the length five meta-path experi-
along with the irrelevant (and already removed) lplf, the ment does not improve the result to an extend which can be
accuracy is affected. This means, these paths cause opposite compensated for the additional computational overhead
effects on accuracy when removing individually and when required to process the meta-paths of length five. Conclu-
removed along with irrelevant and already removed lplf. sively, the selection of upper threshold for meta-path length
Conclusively, lpf, llf, lllf, llpf, lppf are truly relevant and as four is optimal for the selected data set.
could not be removed, since lplf is already removed. Hence,
it is proved that the automated selection process of meta-
paths yielded truly relevant/irrelevant meta-paths. 4.2 Effect of Inclusion of LncRNA Co-Expression
Additionally, the relevant meta-paths selected is signifi- Data
cant in terms of biological shreds of evidence as well. The One of the novelty added in this paper is the incorporation of
irrelevant meta-path, lplf has the semantics that an lncRNA lncRNA co-expression data into the network. This informa-
interacts with a protein which in turn interacts with an tion helps to identify functionally similar lncRNAs. The co-
lncRNA to perform a function. As per the current evidence, expression profiles of lncRNAs in 24 tissues were used to find
proteins are directly performing functions and their func- the similarity between lncRNAs. Clinical evidence ascertain
tions are derived very rarely through lncRNAs [1]. Whereas, that lncRNAs have tissue-specific expression and their loc-
the semantic meanings of all the relevant meta-paths deter- ation greatly affects the function performed by them. There-
mined as relevant by the automated selection process are fore tissue-specific co-expression details of lncRNAs can
found to be closely related to biologically verified lncRNA- identify functionally related lncRNAs with much better clar-
function mechanisms. ity. Considering this fact, it can be concluded that the incorpo-
ration of lncRNA co-expression data has indeed helped to
improve the prediction accuracy. This speculation is justified
4.1.3 Justification for Upper Threshold on Meta-Path with the help of an ROC curve analysis of the function predic-
Length tion result in a network with and without lncRNA co-expres-
The upper threshold for meta-path length is fixed to be four as sion data. The analysis is pictorially represented in Fig. 9. It
described in Section 2.2. The overall complexity of the model shows that a consistent drop in AUC has occurred when
is Oðn2 þ mg þ lfm log mÞ, where n is the largest of number lncRNA co-expression network was discarded from the HIN.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
264 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

TABLE 5 TABLE 6
Predicted Functions of HOTAIR Predicted Functions of H19

5 CASE STUDY
In order to further demonstrate the predictive performance of
It is the primary example of an RNA expressed on one chromo-
the proposed model, a case study containing two well known
some that has been found to influence transcription of another
lncRNAs is conducted. The lncRNAs selected are HOTAIR
chromosome. The HOTAIR gene contains 6,232 bp and encodes
and H19 with respective NONCODE identifiers NON-
2.2 kb lncRNA molecule. HOTAIR is associated with many dis-
HSAG011264, NONHSAG007409. The case study primarily
eases and its aberrant expression causes the progression of vari-
focuses on the predicted functional association of these two
ous cancers. It is classified as an oncogenic lncRNA.
lncRNAs that are clinically established. However, it covers a
Twenty-five major functions of HOTAIR predicted by the
few of the most prominent functional associations predicted
model are shown in Table 5. The study successfully pre-
by the proposed model, even though they are not experimen-
dicted almost all the function associations of HOTAIR, for
tally validated. These are tagged ‘not reported’ in the respec-
which evidence exist in current literature.
tive tables. The list of ‘not reported’ cases are not exhaustive.

5.1 HOTAIR 5.2 H19


HOX transcript antisense RNA (HOTAIR)[26] is an lncRNA H19 is located in chromosome 11. It plays crucial roles in
located on chromosome 12. It is co-expressed with HOXC genes. diseases like Wilms Tomour 2 and Beckwith-Wiedemann

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
V ET AL.: HETEROGENEOUS INFORMATION NETWORK MODEL FOR LONG NON-CODING RNA FUNCTION PREDICTION 265

Syndrome and acts as tumour suppressor in some cancers. [7] H. Ma, Y. Hao, X. Dong, Q. G. J. Chen, J. Zhang, and W. Tian,
“Molecular mechanisms and function prediction of long noncod-
It is involved in all stages of tumorigenesis. It has a highly ing RNA,” The Sci. World J., vol. 2012, no. 3, Nov. 2012, Art. no. 11.
conserved structure and its function depends on the struc- [8] J. L. Rinn and H. Y. Chang, “Genome regulation by long noncod-
ture [33]. H19 is associated with hypertension, coronary ing RNAs,” Annu. Rev. Biochem., vol. 81, no. 1, pp. 145–166, 2012.
artery disease, atherosclerosis, ischemia, and heart failure. [9] Q. Liao et al., “Large-scale prediction of long non-coding RNA
functions in a coding–non-coding gene co-expression network,”
[34]. Twenty-five important functions of H19, predicted by Nucleic Acids Res., vol. 39, no. 9, pp. 3864–3878, 2011.
the model are summarised in Table 6. It can be observed [10] X. Guo et al., “Long non-coding RNAs function annotation: A
that the most important and experimentally proven func- global prediction method based on bi-colored networks,” Nucleic
Acids Res., vol. 41, no. 2, 2013, Art. no. e35.
tion associations of the lncRNA, H19 are predicted success- [11] Y. Xiao et al., “Predicting the functions of long noncoding RNAs
fully by the model. using RNA-seq based on Bayesian network,” BioMed Res. Int.,
vol. 2015, pp. 1–14, Mar. 2015.
[12] F. Chen and Y.-P. P. Chen, “Exploring the ncRNA-ncRNA pat-
6 CONCLUSION AND FUTURE WORK terns based on bridging rules,” J. Biomed. Informat., vol. 43, no. 3,
Growing evidence for functional roles played by lncRNAs pp. 569–577, 2010.
[13] Q. Jiang et al., “LncRNA2Function: A comprehensive resource for
in biological and cellular activities shaped the contemporary functional investigation of human lncRNAs based on RNA-seq
research issue of fast and efficient functional annotation of data,” BMC Genomics, vol. 16, no. 3, 2015, Art. no. S2.
lncRNAs. Since the wet-lab process to functionally annotate [14] J. Yang, A. Li, M. Ge, and M. Wang, “Relevance search for predict-
lncRNAs is expensive and tedious, computational alterna- ing lncRNA-protein interactions based on heterogeneous
network,” Neurocomputing, vol. 206, no. 3, pp. 81–88, 2016.
tives have drawn tremendous research attention these days. [15] D. Xiao, X. Meng, Y. Li, C. Shi, and B. Wu, “AVGSIM: Relevance
The work proposed here, predicts the lncRNA functions measurement on massive data in heterogeneous networks,” J.
from their protein interaction data and co-expression Theor. Appl. Inf. Technol., vol. 84, pp. 101–110, Feb. 2016.
[16] M. Kuhn et al., “Caret package,” J. Statist. Softw., vol. 28, no. 5,
details. While the existing methods mostly concentrate on pp. 1–26, 2008.
protein interaction of lncRNAs for functional annotation, [17] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu, “A survey of heteroge-
this method considers lncRNA co-expression similarity and neous information network analysis,” IEEE Trans. Knowl. Data
association with existing functions, in addition to protein Eng., vol. 29, no. 1, pp. 17–37, Jan. 2017.
[18] S. Yizhou and H. Jiawei, Mining Heterogeneous Information Net-
interaction. More importantly, the method associates func- works- Principles and Methodologies. San Rafael, CA, USA: Morgan
tions with lncRNAs even if they lack protein interaction. and Claypool, 2012.
The study demonstrates that AvgSim applied along meta- [19] Y. Hao et al., “NPInter v3.0: An upgraded database of noncoding
RNA-associated interactions,” Database, vol. 2016, 2016, Art. no.
paths can effectively evaluate the relevance of lncRNA- baw057.
function pairs in an HIN. The model yielded an overall pre- [20] Y. Zhao et al., “NONCODE 2016: An informative and valuable
diction accuracy of 74 percent. data source of long non-coding RNAs,” Nucleic Acids Res., vol. 44,
One possible research direction in future is the incorporation no. D1, pp. D203–D208, 2016.
[21] Y. Xiao, J. Zhang, and L. Deng, “Prediction of lncRNA-protein
of more information about lncRNA to the network. LncRNA is interactions using HeteSim scores based on heterogeneous
proven to interact with different kinds of biomolecules such as networks,” Sci. Rep., vol. 7, no. 1, 2017, Art. no. 3664.
RNA, miRNA and siRNA. Integration of such interaction [22] A. Li, M. Ge, Y. Zhang, C. Peng, and M. Wang, “Predicting long
details to the model may enhance the prediction performance. noncoding RNA and protein interactions using heterogeneous
network model,” BioMed Res. Int., vol. 2015, 2015, Art. no. 671950.
Accuracy of the results may be further enhanced by the incor- [23] A. Franceschini et al.,, “STRING v9.1: Protein-protein interaction
poration of more number of biological characteristics of networks, with increased coverage and integration,” Nucleic Acids
lncRNAs to the model. That apart, the present work uses a cor- Res., vol. 41, no. D1, pp. D808–15, 2013.
[24] The UniProt Consortium, “UniProt: The universal protein knowl-
relation-based procedure to set the upper threshold of meta- edgebase,” Nucleic Acids Res., vol. 45, no. D1, pp. D158–D169, 2017.
path length. Replacement of this approach by a generic and [25] J. B. Zhi-Liang Hu and J. M. Reecy, “CateGOrizer: A web-based
well-formulated algorithm can make the methodology more program to batch analyze gene ontology classification categories,”
effective. This will indeed benefit all HIN mining tasks using Online J. Bioinf., vol. 9, no. 2, pp. 108–112, 2008.
[26] J. L. Rinn et al., “Functional demarcation of active and silent chro-
meta-paths, irrespective of their application domains. matin domains in human HOX loci by noncoding RNAs,” Cell,
vol. 129, no. 7, pp. 1311–1323, 2007.
REFERENCES [27] R. Kogo et al., “Long noncoding RNA HOTAIR regulates poly-
comb-dependent chromatin modification and is associated with
[1] M. Sun and W. L. Kraus, “From discovery to function: The poor prognosis in colorectal cancers,” Cancer Res., vol. 71, no. 20,
expanding roles of long noncoding RNAs in physiology and dis- pp. 6320–6326, 2011.
ease,” Endocrine Rev., vol. 36, no. 1, pp. 25–64, 2015. [28] T. Gutschner and S. Diederichs, “The hallmarks of cancer: A long non-
[2] P. Johnsson, L. Lipovich, D. Grander, and K. V. Morris, coding RNA point of view,” RNA Biol., vol. 9, pp. 703–19, Jun. 2012.
“Evolutionary conservation of long non-coding RNAs; sequence, [29] M.-C. Tsai, R. C. Spitale, and H. Y. Chang, “Long intergenic non-
structure, function,” Biochimica et Biophysica Acta (BBA)-General coding RNAs: New links in cancer progression,” Cancer Res.,
Subjects, vol. 1840, no. 3, pp. 1063–1071, 2014. vol. 71, no. 1, pp. 3–7, 2011.
[3] J. E. Wilusz, H. Sunwoo, and D. L. Spector, “Long noncoding [30] Y. Li et al., “LncRNA ontology: Inferring lncRNA functions based
RNAs: Functional surprises from the RNA world,” Genes Develop., on chromatin states and expression patterns,” Oncotarget, vol. 6,
vol. 23, no. 13, pp. 1494–1504, 2009. no. 37, 2015, Art. no. 39793.
[4] J. S. Mattick and I. V. Makunin, “Non-coding RNA,” Hum. Mol. [31] T. Yiwei, H. Hua, G. Hui, M. Mao, and L. Xiang, “HOTAIR inter-
Genetics, vol. 15, no. suppl_1, pp. R17–R29, 2006. acting with MAPK1 regulates ovarian cancer skov3 cell prolifera-
[5] J. E. Wilusz, H. Sunwoo, and D. L. Spector, “Long non coding tion, migration, and invasion,” Med. Sci. Monitor: Int. Med. J. Exp.
RNAs: Functional surprises from the RNA world,” Genes Develop., Clin. Res., vol. 21, 2015, Art. no. 1856.
vol. 23, no. 13, pp. 1494–1504, 2009. [32] T.-L. Cheng and Z. Qiu, “Long non-coding RNA tagging and
[6] Q. Guo et al., “Comprehensive analysis of lncRNA-mRNA co- expression manipulation via CRISPR/Cas9-mediated targeted
expression patterns identifies immune-associated lncRNA bio- insertion,” Protein Cell, vol. 9, pp. 820–825, 2018.
markers in ovarian cancer malignant progression,” Sci. Rep., vol.
5, 2015, Art. no. 17683.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.
266 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 19, NO. 1, JANUARY/FEBRUARY 2022

[33] E. Raveh, I. J. Matouk, M. Gilon, and A. Hochberg, “The H19 long Adheeba Thahsin received the PG degree in
non-coding rna in cancer initiation, progression and metastasis– computer science and engineering from the
A proposed unifying theory,” Mol. Cancer, vol. 14, no. 1, 2015, Department of Computer Science and Engineer-
Art. no. 184. ing, National Institute of Technology Calicut, India
[34] C. P. Gomes et al., “The function and therapeutic potential of long in 2018. She currently works for Nokia Networks
non-coding RNAs in cardiovascular development and disease,” Chennai, India. She is interested in bioinformat-
Mol. Ther. - Nucleic Acids, vol. 8, pp. 494–507, 2017. ics, data mining, and networks.
[35] C.-X. Li et al., “H19 lncRNA regulates keratinocyte differentiation
by targeting miR-130b-3p,” Cell Death Disease, vol. 8, no. 11, 2017,
Art. no. e3174.
[36] J. Zhou et al., “H19 lncRNA alters DNA methylation genome wide
by regulating S-adenosylhomocysteine hydrolase,” Nature Com-
mun., vol. 6, 2015, Art. no. 10221. Manju M received the PhD degree in life sciences
[37] W. Yang, N. Ning, and X. Jin, “The lncRNA H19 promotes cell from the University of Kerala, India, in 2011. Cur-
proliferation by competitively binding to miR-200a and derepress- rently she is working as an assistant professor
ing b-catenin expression in colorectal cancer,” BioMed Res. Int., with the Department of Zoology, KSM DB College
vol. 2017, 2017, Art. no. 2767484. Sasthamcotta, Kerala, India since July 2012. Her
[38] J. Zhou et al., “H19 lncRNA alters DNA methylation genome wide research interests include molecular biology, his-
by regulating S-adenosylhomocysteine hydrolase,” Nature Com- topathology, phytochemistry, and bioinformatics.
mun., vol. 6, 2015, Art. no. 10221.
[39] Y. Huang, Y. Zheng, C. Jin, X. Li, L. Jia, and W. Li, “Long non-
coding RNA H19 inhibits adipocyte differentiation of bone mar-
row mesenchymal stem cells through epigenetic modulation of
histone deacetylases,” Sci. Rep., vol. 6, 2016, Art. no. 28897.
Gopakumar G (Member, IEEE) received the PhD
[40] S.-C. Tao, B.-Y. Rui, Q.-Y. Wang, D. Zhou, Y. Zhang, and
degree in bioinformatics from the University of
S.-C. Guo, “Extracellular vesicle-mimetic nanovesicles transport
LncRNA-H19 as competing endogenous RNA for the treatment of Kerala, India, in 2013. Currently he is working as
diabetic wounds,” Drug Delivery, vol. 25, no. 1, pp. 241–255, 2018. an assistant professor with the Department of
[41] J. Pan, “LncRNA H19 promotes atherosclerosis by regulating Computer Science and Engineering, National
MAPK and NF-kB signaling pathway,” Eur. Rev. Med. Pharmacol. Institute of Technology Calicut, India since July
Sci., vol. 21, no. 2, pp. 322–328, 2017. 2010. His research interests include RNA bioin-
formatics, biological network analysis, and data
[42] J. Zhang, Z. Zhang, Z. Wang, Y. Liu, and L. Deng, “Ontological
function annotation of long non-coding RNAs through hierarchical mining. He is a member of the ACM.
multi-label classification,” Bioinformatics, vol. 34, pp. 1750–1757,
Dec. 2017.

Sunil Kumar P V received the PhD degree in


computer science and engineering from the " For more information on this or any other computing topic,
Department of Computer Science and Engineer- please visit our Digital Library at www.computer.org/csdl.
ing, National Institute of Technology Calicut,
India. He is currently working as an associate
professor with the Computer Science Depart-
ment of CMR Institute of Technology, Bangalore,
India. His areas of interests includes computa-
tional biology, bioinformatics, data mining, and
biological networks.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on March 02,2022 at 11:22:49 UTC from IEEE Xplore. Restrictions apply.

You might also like