You are on page 1of 8

Comparison tools for lncRNA identification:

analysis among plants and humans


Tatianne da Costa Negri Alexandre Rossi Paschoal Wonder Alexandre Luz Alves
Department of Computer Science Department of Computer Science Department of Computer Science
UNINOVE - Universidade Nove de Julho Universidade Tecnológica Federal do Paraná UNINOVE - Universidade Nove de Julho
São Paulo, Brazil Cornélio Procópio, Brazil São Paulo, Brazil
tatinegri@gmail.com paschoal@utfpr.edu.br wonder@uni9.pro.br

Abstract—This article has as its main objective the evaluation particular, research in this area in plants is more backward
of the differences between long non-coding RNAs of plants and than in humans and animals [3], [9].
humans. Long non-coding RNAs are also known as lncRNAs. Thus, the identification of lncRNAs is probably one of the
The lncRNAS belong to the class of RNAs that do not encode
proteins and are related to several biological functions, such major challenges of RNA research for the next 20 years [10].
as chromatin modifications, post-transcriptional regulation and To contribute to this task, bioinformatics approaches represent
mainly in the different development processes of diseases such a powerful strategy to identify the best lncRNA candidates
as cancer. In this work, we want to verify the existence of for functional characterization. In this sense, several tools
differences in lncRNAs in plants and humans using state-of- have emerged in recent years, for example: CPC2 [11], CPAT
the-art approaches to identify lncRNAs. The main reason for
the study is that there are differences between the miNAs [12], RNAplonc [13], PlncPRO [14], LGC [15], PLEK [16],
(small ncRNAs) of plants and humans, whether in biological CPPRED [17], PredLnc-GFStack [18], and others [19]–[22]].
or computational characteristics, for lncRNAs it is still an open Despite these prediction tools, some were designed using a
question. To answer this question, this paper proposes to show the data set of training constituted either by plant, human, or
results of two ncRNAS prediction tools, trained with humans, and mixed data. This leads us to investigate in this work whether
which are widely used for lncRNA prediction: CPC2 and CPAT.
We will also show results from tools used to predict lncRNAS the patterns that characterize lncRNAs in plants and humans
in plants, which are trained with plant data: the RNAplonc, the are similar. By the way, the microRNAs (i.e., ncRNAs that
PlncPRO tool that contains two versions, one for monocot and contain about 22 nt) existing in plants and animals are different
one for dicot and the LGC tool that was trained with plants in biological or computational characteristics [23], [24] but
and humans. The results of tools trained with human data will little is known about lncRNAs.
also be displayed: PLEK, CPPRED and PredLnc-GFStack. These
eight tools were applied in two sets of tests, one composed of eight Given this consideration, the goal of this article is to
species of plants (Amborella trichopoda, Brachypodium distachyon, perform a cross-validation with different tools for lncRNAs
Citrus sinensis, Manihot esculenta, Ricinus communis, Solanum identification using plant and human data sets in order to
tuberosum, Sorghum bicolor, Zea mays) and the other composed understand how similar their predictions are. For this, we
of human lncRNAS. present nine tools used to identify lncRNAs. These tools were
Index Terms—tool, long RNAs, predict, non-coding, bioinfor-
matics used to classify three different groups of tests: (1) group of
plant tools; (2) group of human tools; (3) group of mixed (plant
I. I NTRODUCTION and human) tools. Among the tools compared in this article,
The non-coding RNAs (ncRNAs) constitute a heterogeneous we have: (1) three tools trained with plant data: RNAplonc
group of RNA molecules, which can be classified in differ- [13], PLncPRO mono [14], PLncPRO dico [14]; (2) four tools
ent ways according to their location, length, and biological trained with human data:CPAT [12], PredLnc-GFStack [18],
function [1]–[3]. They belong to the class of RNAs that do PLEK [16] and CPPRED [17]; (3) two tools trained with mixes
not encode proteins, besides, they are important players in of plant and human data: CPC2 [11] and LGC [15]. Comparing
the regulation of gene transcription, splicing, and translation the results, we found that the RNAplonc [13] tool performed
[4]. In addition, the recent wide applications of the high- better in plant data whereas the CPPRED [17] tool stood out
throughput approaches have facilitated the identification of in human data, whereas the CPC2 [11] tool is the one that
thousands of novel non-coding RNAs in many organisms, such has the best result in the mixed group, showing intermediate
as humans, animals, and plants [3], [5]. Specifically, long non- results in the plant and human data sets.
coding RNAs (lncRNAs) are ncRNA that contain more than The rest of the paper is organized as follows: In Section II,
200 nucleotides (nt) [6]. The research in this field is still we describe the method used, in the subsection II-A, we
in its infancy, especially in the case of plants. Thus far, only show the metrics used for evaluation and performance of the
a few lncRNAs have been sufficiently described [7], [8]. In tests, in the Subsection II-B and in the Subsection II-C, we
describe the data sets used in the tests and in the comparisons.
In Section III, we started the Subsection III-A, presenting

978-1-7281-9468-4/20/ $ 31,00 ©2020 IEEE


the computers used to apply the tests. Then we have the tested with human sequences and PredLnc-GFStack [18]
results of the tests performed divided into 3 Subsections: – uses a random forest algorithm with multiple features of
Subsection III-B (1- group of plant tools), Subsection III-C the ORF, k-mer and Fickett score, trained and tested with
(2- group of human tools) and Subsection III-D (3- group of human sequences; and CPAT [12] – computes the coding
mixed (plant and human) tools). In Section IV we discussed potential with a logistic regression based on open reading
the results, pointing out the best tools for each group in each frame and nucleotide arrangement metrics, trained and
of the data sets, human and plants. Finally, we conclude our tested with human sequences.
research in Section V. • 3- group of mixed (plant and human) tools: this group
is made up of the LGC [15] and CPC2 [11] tools,
A. Tools to identify lncRNAs
which were trained with human and plant sequences.
The field of ncRNA research is growing rapidly. However, LGC [15] – uses a random forest algorithm with ORF size
it is still a challenge to distinguish lncRNAs from protein cod- features and gc content, trained and tested with plants and
ing genes, as lncRNAs share many characteristics similar to human sequences; and CPC2 [11] – evaluates the coding
mRNAs. In addition, there are many transcripts or incomplete potential by using a random forest trained with Fickett
genes that are poorly annotated or contain sequence errors that score, ORF length, ORF integrity and point isoelectric
also generate major discrimination errors. Over the past ten based on the largest ORF, trained and tested with plants
years, many efforts have been made to identify lncRNA and (Arabidopsis athaliana) and human sequences.
many approaches have been developed to make more precise
We can also say that we have tools based on a classifier
discrimination. Several studies [25], [26] have summarized and
trained with a set of features extracted from transcript se-
revised approaches to identification and analysis of ncRNAs,
quences. The classifier is then used to predict new potential
and reported on discussions of methods of predicting lncR-
lncRNAs. The most relevant tools in this category are: PLEK
NAs.
[16], RNAplonc [13], LGC [15], CPPRED [17], PredLnc-
Different approaches aiming to detect different types of
GFStack [18]. We also have tools focused on detecting the
ncRNAs exist in the literature, such as: CONC [27], CPC
coding potential of a transcript and is generally used to
[19], PhyloCSF [28], CPAT [12], CNCI [29], LncRNA-MFDL
discard coding transcripts in lncRNA identification pipelines.
[30], LncTar [31], COME [32], LncRNAnet [33], CPC2 [11],
However, recently it has been demonstrated that transcripts
PLEK [16], RNAplonc [13], PlncPRO [14], CPPred [17],
previously classified as lncRNAs are indeed coding and repre-
PredLnc-GFStac [18], PLAIDOH [22]. In conclusion, many
sent a source of new peptides [34], [35]. The most prominent
computational methods prove to be effective in predicting
tools in this category are: CPAT [12], CPC2 [11], PlncPro [14].
non-coding RNAs (ncRNAS), being essential to reduce the
large number of potential ncRNAs and to point out the
highly reliable candidate ncRNAs that are used for further II. M ATERIALS AND M ETHODS
experimental validation.
A. Applied method
We compare and run some of these popular machine-based
tools, which use ”.fasta” files as input, separated into three In this section, we show the methodology and techniques
groups: applied to carry out the tests and obtain the results (see fig. 1).
• 1- group of plant tools:this group is composed of the
RNAplonc [13], and PlncPRO [14] mono and dico tools,
trained with plant sequences. RNAplonc [13] – uses a
REPtree algorithm with multiple features of the open
reading frame (ORF) and k-mer, trained and tested
with plants; and PlncPRO [14] – uses a random forest
algorithm with ORF size and coverage, BLASTX for
alignment, SwissPROT to find out if it codes, trained and
tested with plants, with two models, one for monocots and
one for dicots.
• 2- group of human tools: group that consists of the
tools CPAT [12], PredLnc-GFStack [18], PLEK [16] and
CPPRED [17], which are trained with human sequences.
PLEK [16] – uses a Support Vector Machine trained with
an improved k-mer scheme to distinguish lncRNAs from
messenger RNAs (mRNAs) in the absence of genomic
sequences or annotations, trained and tested with human
Fig. 1. Overview of the proposed methodology to validate the tests. First, we
sequences; CPPred [17] – uses a Support Vector Machine separated two data sets (DS plants and DS human). Second, we separated the
trained with an improved k-mer scheme, features of the tools into 3 groups: Group 1- of tools trained with plants; Group 2- Human-
ORF (length and coverage) and Fickett score, trained and trained tools and Group 3- Mixed tools, trained with humans and plants.
First, we created two data sets, one for plants (DS plants) and with public available annotation. Already in both lncRNA
and another for humans (DS humans) which are detailed in the and mRNA data sets, we used only sequences longer than 200
test data set Section II-C. We consider three groups of tools, nt, and also removed sequence redundancy at 80% of identity
where we try to select the most used tools in the literature with the CD-HIT-EST (v4.6.1) [40]. The mRNAs were chosen
to identify lncRNAs in plants and humans. 1-group of plant randomly in the same amount of each set of lncRNA, to keep
tools (RNAplonc [13], PlncPRO [14] mono and dico), 2- group the tests balanced.
of human tools (CPAT [12], PredLnc-GFStack [18], PLEK
[16] and CPPRED [17]) and the 3- group of mixed (plant TABLE I
and human) tools (LGC [15] and CPC2 [11]) detailed in the P LANT DATA SETS USED TO TEST. T YPE OF DATA SET: LNC RNA
(P OSITIVE ) AND M RNA(N EGATIVE ). “T OTAL ” REFERS TO THE ORIGINAL
Section I-A (tools to identify lncRNAs). NUMBER OF SEQUENCES AVAILABLE FOR EACH DATA SET, WHEREAS
Having the two data sets and the groups of tools defined, “U SED ” REFERS TO THE TOTAL NUMBER OF SEQUENCES OBTAINED AFTER
we apply the cross-validation technique, a technique used to THE FILTERING PROCESS WAS IMPLEMENTED .

assess the generalized of a model, from a set of data [36]. This Species # total lncRNA # total mRNA # used
technique is widely used in problems where the purpose of Amborella trichopoda 5698 26846 3823
modeling is prediction. We then seek to estimate how accurate Brachypodium distachyon 5584 52972 4868
Citrus sinensis 2562 46147 2292
each model is in practice, that is, its performance for each data Manihot esculenta 3468 41318 3017
set. We ran the nine tools with the two data sets and to quantify Ricinus communis 4198 31221 4080
the classification performance under a unified standard, we first Solanum tuberosum 6680 51472 5607
Sorghum bicolor 5305 47205 4541
characterized in each data set the lncRNAs as a positive class Zea mays 18154 63540 12071
and protein coding transcripts as a negative class; then the
performance of these tools can be assessed with four defined
metrics: sensitivity, specificity, accuracy and F1-score. For the human data set, we used 35324 lncRNA sequences
To better analyze and visualize the results, we created (positive data set). LncRNAs were identified in the FANTOM
different graphs with the obtained metrics, which will be 5 [41] project, and for mRNas we used 35324 human transcrip-
presented and discussed in the next section. tion sequences (negative data set) from the Ensembl [42] data
set. The mRNAs were chosen randomly in the same amount
B. Evaluation criteria of lncRNA, to keep the tests balanced.
Shows the results on the evaluation of the best trained mod-
els according to seven statistical criterion [37] based on the: III. R ESULTS
True Positive (TP) estimator measures the lncRNAs correctly
We will start the results section talking a little about the
predicted; the True Negative (TN) estimator represents the
computers that were used to apply the tests, which are detailed
coding RNAs correctly classified from the negative data set;
in the Subsection III-A.
the False Positive (FP) estimator describes all those negative
entities that are incorrectly classified as lncRNAs, and the To improve the visibility of the data, we separated the results
False Negative (FN) estimator represents those true lncRNAs into three subsections. The bare subsection shows the results
that are incorrectly classified as non-lncRNAs. The seven of the tools of group one: group of plant tools, applied in
estimators criteria include: sensitivity (SE), specificity (SPC), the plant and human data sets (Subsection III-B). The bare
accuracy (ACC), F1-Score (see the Equations 1, 2, 3 and 4). subsection, which shows the result of the tools of group two:
group of human tools, applied on the data base of plant and
TP human (Subsection III-C). And the naked subsection, which
SE = (1)
TP + FN shows the result of the tools of group three: group of mixed
(plant and human) tools, applied in the plant and human data
TN set (Subsection III-D).
SPC = (2)
TN + FP
TP + TN A. Used computers
ACC = (3)
TN + FP + TP + FN Two computers were used to perform the tests: one personal
2 × TP laptop and a computer available at the Informatics Labora-
F1-Score = (4) tory (LABINFO) of the Federal Technological University of
2 × TP + FP + FN
Paraná. The laptop consists of a samsung with an Intel Core i7-
C. Test data sets 3630QM processor, with 8GB 2.40GHz RAM memory, with
We defined two sets of data to assess the performance of the NVIDIA GF108M [GeForce GT 630M] video card, is with
tools and to make comparisons between the plant and human the Linux Debian 10.2 operating system and the second is a
lncRNAS. For the plant data set, we used eight species. For server that has an Intel Xeon E5-2620 v3 processor, with 15 M
lncRNAS (positive data) we use the GreeNC [38] data set and cache, 2.40 GHz, having 32GB of RAM memory and NVIDIA
for negative data (mRNA) we use Phytozome [39] (see table Titan V graphics card, uses the Linux operating system Mint
I). Species were chosen based on their phylogenetic diversity 19.2 Tina.
B. 1-Group of plant tools

This first test was applied to the data set of eight plant
species in the Table I. Tests were carried out with the
RNAplonc [13] tool, a tool that was developed by our group,
being trained with plants. Another tool used for the tests is the
PLncPRO [14] tool, which is also trained with plants and has
two models, one for monocotyledonous plants and another for
dicotyledonous plants (see table II).

TABLE II
R ESULT OF PLANT DATA SET WITH LNC RNA TOOLS AND PLANT TRAINED Fig. 2. Average of the sum of all plant species by metrics.
TOOLS

Data set Tools SE SPC ACC F1-score In the second applied test, we have the result of the human
Amborella PLncPRO mono 86.19 96.29 91.24 90.77 data set (subsection II-C) tested with tools trained in plants:
trichopoda PLncPRO Dico 78.11 94.85 86.48 85.24
RNAplonc 100.00 77.01 88.50 89.69 PLncPRO Mono [14], PLncPRO Dico [14] and RNAplonc
Brachypodium PLncPRO mono 76.60 83.01 79.81 79.14 [13].
distachyon PLncPRO Dico 70.52 87.80 79.16 77.19
RNAplonc 97.64 86.09 91.87 92.31
We have a graph (fig. 3), with the result of the metrics for
Citrus PLncPRO mono 57.33 78.01 67.67 63.94 each (fig. 3), where we can see that PLncPRO Mono [14] has
sinensis PLncPRO Dico 47.69 94.94 71.31 62.44 the best sensitivity (SE) with 88.52%, then RNAplonc [13]
RNAplonc 99.48 88.35 93.91 94.23
Manihot PLncPRO mono 75.11 76.53 75.82 75.65 with 76.85%. For specificity (SPC), we have PLncPRO Dico
esculenta PLncPRO Dico 66.85 87.60 77.23 74.59 [14] with 94.15%, followed by RNAplonc [13] with 92.92%.
RNAplonc 99.90 86.64 93.27 93.69
Ricinus PLncPRO mono 85.56 84.92 85.24 85.29 We can also see that the accuracy of the RNAplonc [13] tool
communis PLncPRO Dico 73.82 94.21 84.02 82.21 is 84.90%, while the PLncPRO Dico [14] tool has 83.01%.
RNAplonc 99.98 81.47 90.72 91.51
Solanum PLncPRO mono 67.72 60.37 64.04 65.32 For the F1-score metric we have RNAplonc [13] again with
tuberosum PLncPRO Dico 63.47 81.31 72.39 69.69 the best average of 83.56%, followed by PLncPRO Mono [14]
RNAplonc 99.68 76.07 87.87 89.15
Sorghum PLncPRO mono 80.03 87.91 83.97 83.31
with 80.90%.
bicolor PLncPRO Dico 75.17 87.51 81.34 80.11
RNAplonc 96.34 86.15 91.25 91.67
Zea PLncPRO mono 95.88 86.96 91.42 91.79
mays PLncPRO Dico 83.00 86.74 84.87 84.58
RNAplonc 99.36 84.94 91.56 91.54

As we can see, RNAPlonc [13] performed better in the


eight species (see table II), being the best of the programs
to predict lncRNAs in plants (see fig. 2). The graph shows
us an average sensitivity (SE) of 99.05% of RNAplonc [13],
with PLncPRO mono [14] second with 78.05%. In specificity
(SPC), PLncPRO Dico [14] has a better average of 89.37%,
followed by RNAplonc [13] with 83.34%. Again we see
RNAplonc [13] with the best accuracy metric (ACC) having
an average of 91.12%, followed by the PLncPRO mono [14] Fig. 3. Results of the metrics of the test performed on the human data set
with trained tools with plant data.
tool with 79.90%. In the last F1-score metric we have the
RNAplonc [13] with the best average of 91.72%, with the
C. 2-Group of human tools
PLncPRO mono [14] 79.40%.
For this test, we use the human data set (subsection II-C)
Looking at the results of the PLncPRO tool [14], we can and apply the tool CPAT [12] which is a tool of ncRNAs
see that the model for monocotyledon has better results in trained with human data. We also apply tools trained with hu-
dicotyledonous plants as in the species: Amborella trichopoda, man lncRNAs: PLEK [16], CPPRED [17], PredLnc-GFStack
Citrus sinensis, Manihot esculenta and Ricinus communis. The [18].
same can be seen in the dicotyledon model has better results In the metrics graph (fig. 4), we can see that the CPPRED
in monocotyledonea as in Brachypodium distachyon. [17] tool has a better average sensitivity (SE) of 92.55%,
followed by the PLEK [16] with 83.83%. For Specificity [17] with 65.68%. The best average specificity is the PredLnc-
(SCP), we can see that the best tool is PredLnc-GFStack [18], GFStack [18] tool with 84.81%, followed by the CPPRED
with 98.60%, with CPAT [12] second with 95.57%. For the [17] tool with 75.11%. For accuracy, we have PLEK [16] with
average accuracy (ACC), we have the best CPPRED [17] tool 79.31% and CPPRED [17] with 70.39%. In the average of the
with 94.10%, followed by the PLEK [16] tool with 89.46%. metrics F1-score we have the PLEK [16] with 80.99% and in
For the last metric F1-score, we have the CPPRED [17] with second the CPPRED [17] 68.71%.
93.97%, followed by the PLEK [16] tool with 88.19%.

Fig. 4. Results of the test performed on the human data set with tools trained Fig. 5. Average of the metrics of the test performed on the plant’s data set
with human data. with tools trained with human data.

In this second test of this group two, we apply the tools


trained with human data in our plant test set (subsection II-C). D. 3- Group of mixed (plant and human) tools
We can notice that the PredLnc-GFStack [18] (PredLnc) tool
did not show results for two plant species: Brachypodium
In this test we have two tools that were trained with plant
distachyon and Sorghum bicolor (table III).
and human data: LGC [15] and CPC2 [11]. The first test will
show the result of the two tools mixed in the plant data set
TABLE III (table IV).
R ESULT OF THE PLANT DATA SET WITH TOOLS TRAINED WITH HUMAN
DATA

Data set Tools SE SPC ACC F1-score


Amborella PLEK 100.00 74.73 87.37 88.78
TABLE IV
trichopoda CPPRED 60.06 54.56 57.31 58.45 R ESULT OF THE PLANT DATA SET WITH TOOLS TRAINED WITH HUMAN
PredLnc 12.98 96.91 54.95 22.36 AND PLANTS DATA
Brachypodium PLEK 83.85 70.23 77.04 78.51
distachyon CPPRED 83.11 87.18 85.15 84.84 Data set Tools SE SPC ACC F1-score
PredLnc - - - - Amborella CPC2 65,73 83,83 74,78 72,27
Citrus PLEK 94.76 63.92 79.34 82.10 trichopoda LGC 61,97 81,18 71,58 68,56
sinensis CPPRED 61.87 67.93 64.90 63.80 Brachypodium CPC2 88,58 86,32 87,45 87,59
PredLnc 28.42 97.77 63.09 43.50 distachyon LGC 89,93 81,83 85,88 86,43
Manihot PLEK 94.43 61.78 78.11 81.18 Citrus CPC2 82,42 89,92 86,17 85,63
esculenta CPPRED 64.27 68.71 66.49 65.73 sinensis LGC 74,89 86,72 80,81 79,60
PredLnc 50.83 93.93 72.38 64.79 Manihot CPC2 87,67 89,00 88,33 88,25
Ricinus PLEK 99.58 63.73 81.65 84.44 esculenta LGC 77,81 87,96 82,89 81,97
communis CPPRED 58.24 52.48 55.36 56.61 Ricinus CPC2 76,40 82,13 79,26 78,65
PredLnc 10.59 98.16 54.38 18.84 communis LGC 69,64 80,38 75,01 73,59
Solanum PLEK 88.10 73.02 81.36 83.93 Solanum CPC2 74,84 86,85 80,21 80,69
tuberosum CPPRED 71.07 83.02 76.42 76.91 tuberosum LGC 68,76 83,80 76,28 74,35
PredLnc 40.21 91.92 66.06 54.23 Sorghum CPC2 88,66 84,38 86,29 85,27
Sorghum PLEK 79.85 50.56 63.67 66.29 bicolor LGC 89,98 82,11 86,04 86,57
bicolor CPPRED 82.18 76.49 79.04 77.82 Zea CPC2 87,71 82,98 85,35 85,68
PredLnc - - - - mays LGC 89,87 77,11 83,49 84,48
Zea PLEK 92.84 62.75 77.79 80.70
mays CPPRED 74.55 90.93 82.74 81.20
PredLnc 40.65 93.58 67.11 55.28

Looking at fig. 6 from the average of the tools, we can see


In order to compare the results, we have the graph (fig. 5), that CPC2 [11] clearly outperforms the LGC [15] tool in the
with the average of the metrics, we can see the PLEK [16] plant data set, and it can be seen that in none of the averages
with better sensitivity (SE) of 89.18%, followed by CPPRED it reached 90%.
data set (fig. 8), while the plant group tools do better on the
plant data set (fig. 9).
Analyzing still the two figures (fig. 8 and fig. 9), we can
notice that the tools of the mixed group, are distributed,
between the best and the worst tools in the two data sets (plants
and human).

Fig. 6. Average of the metrics of the test performed on the plant’s data set
with tools mixed (plant and human).

In this second test of the mixed group, we have the results


of the two tools, LGC [15] and CPC2 [11], in the human
data set, as we can see in fig. 7. In fig. 7, Again we can see
that CPC2 [11] outperforms LGC [15] in all metrics, with a
Fig. 8. Result of specificity and sensitivity metrics in human data set
notable difference in sensitivity, where CPC2 [11] has 77, 21%
and LGC [15] has only 59.03%.

Fig. 9. Result of specificity and sensitivity metrics in plants data set

Fig. 7. Average of the metrics of the test performed on the human data set
with tools mixed (plant and human). In these next two figures (fig. 10 and fig. 11), we have the
best tool in each group (plant, mixed and human), showing
IV. D ISCUSSION accuracy (ACC) and F1-score, ordered by sensitivity SE.
ACC indicates overall model performance. Among all clas-
To better compare the results, let’s start by analyzing two sifications, how many the model correctly classified, that is,
figures, fig. 8 and fig. 9, where we have grouped the three the ACC is the measure that translates the accuracy of the
sets of tools (plant, mixed and human), from the two data sets tool, that is, it allows determine the percentage of correct
(Plants and human), using the metrics of specificity (SPC) and tool settings. The F1-score, on the other hand, is a harmonic
sensitivity (SE). average between precision and recall. The F1-score is simply
Two basic concepts are sensitivity (SE) and specificity a way of looking at just one metric instead of two (precision
(SPC). The SE is the tool’s ability to identify lncRNAs, that and recall). It is a harmonic mean between the two, which is
is, corresponds to the probability of correctly classifying an much closer to the lower values than a simple arithmetic mean.
lncRNA. The SPC is the ability of the tool to identify an That is, when there is a low F1-score, it is an indication that
mRNA (transcribed), that is, corresponds to the probability of either the accuracy or the recall is low.
correctly classifying a coding RNA. In fig. 10, we have the tool RNAplonc [13], being the tool
In these figures (fig. 8 and fig. 9), we can clearly see that of the group of plants that obtained the best result in the
the tools in the human group have better results in the human human data set. We also have in the fig. 10 the tool CPPRED
[17], representing the human group with the best result in the V. C ONCLUSIONS AND FUTURE PERSPECTIVES
human data set. For the mixed group, we have the tool CPC2 The knowledge about lncRNAS, its characteristics, func-
[11], which obtained the best result in the human data set (see tions and performance within the cell is still scarce, but with
fig. 10). this work we can, through the results of the tests, show that
Comparing the figures (fig. 8 and fig. 10) and the test there are differences between plant and human lncRNAs. We
results with the human data set, we can see that the tools can see that tools developed and trained with plant data cannot
trained with human data obtained better results than the tool overcome the results of tools developed and trained with
trained with plants data or the tools of the mixed group. As human data, if applied to a human data set. And the same
can be seen in the figure 10, the CPPRED [17] tool obtained is true for human-trained tools when tested on plants.
an accuracy of 94.10%, while RNAplonc [13] obtained an This result justifies the methodology used and allows us
accuracy of 82.85% in the same set data, while the CPC2 to elucidate answers to our initial question, the difference
[11] tool achieved 86.24% accuracy. between lncRNAs in plants and humans is true, and makes
In fig. 8 showing the results of the plant data set, we can us think about how to quantify and describe such differences.
see that the tools trained with plants have better results. In
ACKNOWLEDGMENT
fig. 10, the tool of the plant group RNAplonc [13] obtained
an accuracy of 91.12%, while the tool of the human group We gratefully acknowledge the support of NVIDIA Cor-
PLEK [16] obtained an accuracy of 79.31% and the mixed poration with the donation of the Titan V GPU used for
group tool CPC2 [11], had an accuracy of 83.48%. this research. Thank CAPES for the doctoral scholarship.
This project has been supported by a PROBAL Grant
(CAPES/DAAD - 88887.144045/2017-00).
R EFERENCES
[1] B. B. Amor, S. Wirth, F. Merchan, P. Laporte, Y. d’Aubenton Carafa,
J. Hirsch, A. Maizel, A. Mallory, A. Lucas, J. M. Deragon et al., “Novel
long non-protein coding rnas involved in arabidopsis differentiation and
stress responses,” Genome research, vol. 19, no. 1, pp. 57–69, 2009.
[2] J. Liu, C. Jung, J. Xu, H. Wang, S. Deng, L. Bernad, C. Arenas-Huertero,
and N.-H. Chua, “Genome-wide analysis uncovers regulation of long
intergenic noncoding rnas in arabidopsis,” The Plant Cell, vol. 24, no. 11,
pp. 4333–4345, 2012.
[3] Q.-H. Zhu and M.-B. Wang, “Molecular functions of long non-coding
rnas in plants,” Genes, vol. 3, no. 1, pp. 176–190, 2012.
[4] J. J. Quinn and H. Y. Chang, “Unique features of long non-coding rna
biogenesis and function,” Nature Reviews Genetics, vol. 17, no. 1, p. 47,
2016.
[5] M. Q. Hassan, C. E. Tye, G. S. Stein, and J. B. Lian, “Non-coding rnas:
Epigenetic regulators of bone development and homeostasis,” Bone,
Fig. 10. Accuracy result by F1-score of the human data set vol. 81, pp. 746–756, 2015.
[6] S.-Y. Ng, L. Lin, B. S. Soh, and L. W. Stanton, “Long noncoding rnas
in development and disease of the central nervous system,” Trends in
Genetics, vol. 29, no. 8, pp. 461–468, 2013.
[7] J. L. Rinn, M. Kertesz, J. K. Wang, S. L. Squazzo, X. Xu, S. A.
Brugmann, L. H. Goodnough, J. A. Helms, P. J. Farnham, E. Segal
et al., “Functional demarcation of active and silent chromatin domains
in human hox loci by noncoding rnas,” cell, vol. 129, no. 7, pp. 1311–
1323, 2007.
[8] Y. Wang, X. Fan, F. Lin, G. He, W. Terzaghi, D. Zhu, and X. W. Deng,
“Arabidopsis noncoding rna mediates control of photomorphogenesis by
red light,” Proceedings of the National Academy of Sciences, vol. 111,
no. 28, pp. 10 359–10 364, 2014.
[9] Y. Bai, X. Dai, A. P. Harrison, and M. Chen, “Rna regulatory networks
in animals and plants: a long noncoding rna perspective,” Briefings in
functional genomics, vol. 14, no. 2, pp. 91–101, 2015.
[10] T. R. Cech, “RNA World research-still evolving,” RNA, vol. 21, no. 4,
pp. 474–475, Apr 2015.
[11] Y.-J. Kang, D.-C. Yang, L. Kong, M. Hou, Y.-Q. Meng, L. Wei, and
G. Gao, “Cpc2: a fast and accurate coding potential calculator based on
sequence intrinsic features,” Nucleic acids research, vol. 45, no. W1,
pp. W12–W16, 2017.
[12] L. Wang, H. J. Park, S. Dasari, S. Wang, J.-P. Kocher, and W. Li,
Fig. 11. Accuracy result by F1-score of the plants data set “Cpat: Coding-potential assessment tool using an alignment-free logistic
regression model,” Nucleic acids research, vol. 41, no. 6, pp. e74–e74,
2013.
These figures make it clear that tools trained with human [13] T. d. C. Negri, W. A. L. Alves, P. H. Bugatti, P. T. M. Saito,
D. S. Domingues, and A. R. Paschoal, “Pattern recognition analysis
data perform better on human data, and tools trained with on long noncoding rnas: a tool for prediction in plants,” Briefings in
plants perform better on plant data. bioinformatics, vol. 20, no. 2, pp. 682–689, 2019.
[14] U. Singh, N. Khemka, M. S. Rajkumar, R. Garg, and M. Jain, “Plncpro [36] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy
for prediction of long non-coding rnas (lncrnas) in plants and its estimation and model selection,” in Ijcai, vol. 14, no. 2. Montreal,
application for discovery of abiotic stress-responsive lncrnas in rice and Canada, 1995, pp. 1137–1145.
chickpea,” Nucleic acids research, vol. 45, no. 22, pp. e183–e183, 2017. [37] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John
[15] G. Wang, H. Yin, B. Li, C. Yu, F. Wang, X. Xu, J. Cao, Y. Bao, Wiley & Sons, 2012.
L. Wang, A. A. Abbasi et al., “Characterization and identification of long [38] A. P. Gallart, A. H. Pulido, I. A. M. De Lagrán, W. Sanseverino, and
non-coding rnas based on feature relationship,” Bioinformatics, vol. 35, R. A. Cigliano, “Greenc: a wiki-based database of plant lncrnas,” Nucleic
no. 17, pp. 2949–2956, 2019. Acids Research, vol. 44, no. Database issue, p. D1161, 2016.
[16] A. Li, J. Zhang, and Z. Zhou, “Plek: a tool for predicting long non- [39] D. M. Goodstein, S. Shu, R. Howson, R. Neupane, R. D. Hayes, J. Fazo,
coding rnas and messenger rnas based on an improved k-mer scheme,” T. Mitros, W. Dirks, U. Hellsten, N. Putnam et al., “Phytozome: a
BMC bioinformatics, vol. 15, no. 1, p. 311, 2014. comparative platform for green plant genomics,” Nucleic acids research,
[17] X. Tong and S. Liu, “Cppred: coding potential prediction based on the vol. 40, no. D1, pp. D1178–D1186, 2012.
global description of rna sequence,” Nucleic acids research, vol. 47, [40] W. Li and A. Godzik, “Cd-hit: a fast program for clustering and
no. 8, pp. e43–e43, 2019. comparing large sets of protein or nucleotide sequences,” Bioinformatics,
[18] S. Liu, X. Zhao, G. Zhang, W. Li, F. Liu, S. Liu, and W. Zhang, “Predlnc- vol. 22, no. 13, pp. 1658–1659, 2006.
gfstack: A global sequence feature based on a stacked ensemble learning [41] C.-C. Hon, J. A. Ramilowski, J. Harshbarger, N. Bertin, O. J. Rackham,
method for predicting lncrnas from transcripts,” Genes, vol. 10, no. 9, J. Gough, E. Denisenko, S. Schmeier, T. M. Poulsen, J. Severin et al.,
p. 672, 2019. “An atlas of human long non-coding rnas with accurate 5 ends,” Nature,
[19] L. Kong, Y. Zhang, Z.-Q. Ye, X.-Q. Liu, S.-Q. Zhao, L. Wei, and G. Gao, vol. 543, no. 7644, pp. 199–204, 2017.
“Cpc: assess the protein-coding potential of transcripts using sequence [42] T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark,
features and support vector machine,” Nucleic acids research, vol. 35, T. Cox, J. Cuff, V. Curwen, T. Down et al., “The ensembl genome
no. suppl 2, pp. W345–W349, 2007. database project,” Nucleic acids research, vol. 30, no. 1, pp. 38–41,
[20] C. Pian, G. Zhang, Z. Chen et al., “LncRNApred: Classification of Long 2002.
Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble
Algorithm with a New Hybrid Feature,” PLoS ONE, vol. 11, no. 5, p.
e0154567, 2016.
[21] C. M. A. Simopoulos, E. A. Weretilnyk, and G. B. Golding, “Prediction
of plant lncRNA by ensemble machine learning classifiers,” BMC
Genomics, vol. 19, no. 1, p. 316, May 2018.
[22] L. H. Pyfrom, S.C. and J. Payton, “PLAIDOH: a novel method for
functional prediction of long non-coding RNAs identifies cancer-specific
LncRNA activities,” BMC Genomics, vol. 20, no. 137, 02 2019.
[23] Y. Moran, M. Agron, D. Praher, and U. Technau, “The evolutionary
origin of plant and animal micrornas,” Nature ecology & evolution,
vol. 1, no. 3, pp. 1–8, 2017.
[24] Z. Li, R. Xu, and N. Li, “Micrornas from plants to animals, do they
define a new messenger for communication?” Nutrition & metabolism,
vol. 15, no. 1, p. 68, 2018.
[25] Y. Wang, Y. Li, Q. Wang, Y. Lv, S. Wang, X. Chen, X. Yu, W. Jiang,
and X. Li, “Computational identification of human long intergenic non-
coding rnas using a ga–svm algorithm,” Gene, vol. 533, no. 1, pp. 94–99,
2014.
[26] G. M. Ventola, T. M. Noviello, S. D’Aniello, A. Spagnuolo, M. Cecca-
relli, and L. Cerulo, “Identification of long non-coding transcripts with
feature selection: a comparative study,” BMC bioinformatics, vol. 18,
no. 1, p. 187, 2017.
[27] J. Liu, J. Gough, and B. Rost, “Distinguishing protein-coding from non-
coding RNAs through support vector machines,” PLoS Genetics, vol. 2,
no. 4, p. e29, Apr 2006.
[28] M. F. Lin, I. Jungreis, and M. Kellis, “PhyloCSF: a comparative
genomics method to distinguish protein coding and non-coding regions,”
Bioinformatics, vol. 27, no. 13, pp. i275–282, Jul 2011.
[29] L. Sun, H. Luo, D. Bu et al., “Utilizing sequence intrinsic composition to
classify protein-coding and long non-coding transcripts,” Nucleic acids
research, vol. 41, no. 17, p. e166, Sep 2013.
[30] X.-N. Fan and S.-W. Zhang, “lncrna-mfdl: identification of human long
non-coding rnas by fusing multiple features and using deep learning,”
Molecular BioSystems, vol. 11, no. 3, pp. 892–897, 2015.
[31] J. Li, W. Ma, P. Zeng, J. Wang, B. Geng, J. Yang, and Q. Cui,
“LncTar: a tool for predicting the RNA targets of long noncoding
RNAs,” Briefings in Bioinformatics, vol. 16, no. 5, pp. 806–812, 12
2014. [Online]. Available: https://doi.org/10.1093/bib/bbu048
[32] L. Hu, Z. Xu, B. Hu, and Z. J. Lu, “COME: a robust coding potential
calculation tool for lncRNA identification and characterization based on
multiple features,” Nucleic Acids Research, vol. 45, no. 1, pp. e2–e2,
09 2016. [Online]. Available: https://doi.org/10.1093/nar/gkw798
[33] J. Baek, B. Lee, S. Kwon, and S. Yoon, “lncrnanet: Long non-coding rna
identification using deep learning,” Bioinformatics, vol. 1, p. 9, 2018.
[34] Z. Ji, R. Song, A. Regev, and K. Struhl, “Many lncrnas, 5’utrs, and
pseudogenes are translated and some are likely to express functional
proteins,” elife, vol. 4, p. e08890, 2015.
[35] J. Ruiz-Orera, X. Messeguer, J. A. Subirana, and M. M. Alba, “Long
non-coding rnas as a source of new peptides,” elife, vol. 3, p. e03523,
2014.

You might also like