You are on page 1of 10

B RIEFINGS IN BIOINF ORMATICS . VOL 14. NO 1. 36^ 45 doi:10.

1093/bib/bbs010
Advance Access published on 24 March 2012

Detecting miRNAs in deep-sequencing


data: a software performance
comparison and evaluation
Vernell Williamson, Albert Kim, Bin Xie, G. Omari McMichael, Yuan Gao and Vladimir Vladimirov
Submitted: 9th December 2011; Received (in revised form) : 21st February 2012

Abstract
Deep sequencing has become a popular tool for novel miRNA detection but its data must be viewed carefully as the
state of the field is still undeveloped. Using three different programs, miRDeep (v1, 2), miRanalyzer and DSAP, we
have analyzed seven data sets (six biological and one simulated) to provide a critical evaluation of the programs per-
formance. We selected these software based on their popularity and overall approach toward the detection of
novel and known miRNAs using deep-sequencing data. The program comparisons suggest that, despite differing
stringency levels they all identify a similar set of known and novel predictions. Comparisons between the first and
second version of miRDeep suggest that the stringency level of each of these programs may, in fact, be a result of
the algorithm used to map the reads to the target. Different stringency levels are likely to affect the number of pos-
sible novel candidates for functional verification, causing undue strain on resources and time. With that in mind,
we propose that an intersection across multiple programs be taken, especially if considering novel candidates that
will be targeted for additional analysis. Using this approach, we identify and performed initial validation of 12 novel
predictions in our in-house data with real-time PCR, six of which have been previously unreported.
Keywords: deep sequencing; software; miRNA detection; comparison

INTRODUCTION pathway where the precursor and the mature se-


Without question, the discovery of miRNA has quences play the largest role [6–12]. These methods
reshaped our appreciation of gene regulation. This vary in terms of throughput, the amount of resources
class of non-coding RNA (ncRNA) is no longer needed and the false-positive rate. Computational
viewed as ‘junk’, but rather as vital and active partici- approaches, in particular, are characterized by a high
pants in human disease and physiology [1–5]. false-positive rate, largely due to a heavy reliance on
Previous approaches to identify novel miRNAs machine learning techniques [1, 13–16]. Deep
through computational prediction and experimental sequencing presents a viable alternative to previous
analysis have focused largely on the classic biogenesis attempts, but it can be problematic in terms of data

Corresponding authors: Vernell Williamson and Vladimir Vladimirov, Department of Psychiatry, Virginia Institute for
Psychiatric and Behavioral Genetics, Medical College of Virginia of Virginia Commonwealth University, Richmond, VA, USA.
Tel: þ1 804 628 7607; Fax: þ1 804 828 1471; E-mail: vswilliamson@vcu.edu; vivladimirov@vcu.edu
Vernell Williamson is a PhD student in Integrative Life Sciences at Virginia Commonwealth University. Her research interests
include detection of novel miRNA and integration of large data sets.
Albert Kim received his PhD in June, 2011 from Virginia Commonwealth University. He is currently finishing his medical training
with the same university.
Bin Xie is affiliated with the division of Genomics, Epigenomics and Bioinformatics, Lieber Institute for Brain Development,
Baltimore, Maryland, USA.
Omari McMichael is a laboratory specialist at the Virginia Institute of Psychiatric and Behavioral Genetics.
Yuan Gao is affiliated with Division of Genomics, Epigenomics, and Bioinformatics, Lieber Institute for Brain Development,
Baltimore, Maryland, USA.
VladimirVladimirov is an assistant professor affiliated with Virginia Institute for Psychiatric and Behavioral Genetics and School of
Pharmacy at Virginia Commonwealth University. His research interests include studying miRNA, gene expression and epigenetics of
psychiatric disorders.

ß The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com
Detecting miRNAs in deep-sequencing data 37

Table 1: Other programs that may be used to predict miRNAs from deep-sequencing data

Software Format File format Location

Seqbuster Web based, executable fasta, tab-delimited http://davinci.crg.es/estivill_lab/seqbuster/


miRExpress Executable sequence tag count http://mirexpress.mbc.nctu.edu.tw/
miRNAKey Executable fasta, fastq http://ibis.tau.ac.il/miRNAkey/
MirTools Web based sequence tag count http://59.79.168.90/mirtools
miReNA Executable fasta http://www.ihes.fr/carbone/data8
miRTrap Executable fasta http://davinci.crg.es

generation and specialized skills required to ad- can be summarized in three stages: (i) initial mapping
equately analyze and interpret data. Researchers of the read; (ii) expansion of the mapped locus to
interested in using deep-sequencing techniques for include flanking sequences; and (iii) evaluation of the
miRNA discovery are often confronted with a col- expanded sequence on the basis of negative free
lection of bioinformatic algorithms and approaches energy and structure. Each stage in the process ul-
with very little information on software comparability timately affects the final result, with perhaps the first
and performance. Here, we review three programs, being the most crucial of all. The programs available
which use three different approaches for miRNA de- today vary in terms on the amount of user control
tection in deep-sequencing data. We use experimen- over parameters and input and how they address
tal and simulated data to highlight some of their each of these three stages [17, 20–24] (Table 2).
features and characteristics. The programs profiled, miRDeep (v1, v2) is a program that predicts the
in this article, were chosen on the basis of their popu- presence of miRNA from deep-sequencing data
larity as evidenced by number of citations and using Bayesian probabilities framed on the classic
uniqueness of approach. In addition, we also compare steps of miRNA biogenesis [17–19]. The pipeline
two versions of miRDeep software which is the most first compares the reads to a target genome, and
popular program for miRNA discovery in use today then evaluates the read’s suitability on a thermo-
and in fact, two other detection programs (miRTools dynamic scale. The algorithm assumes that if a read
and miReNA) have also incorporated miRDeep as a is related to miRNA, then it must either be a portion
component of their process [33]. A list of additional of a star, a loop sequence or a mature sequence. The
available software can be found in Table 1. The cell read must demonstrate characteristics similar to al-
lines used in this study were randomly chosen and ready annotated examples, e.g. definite evidence of
represent cells that might otherwise be regularly em- a present 2 nt 30 overhang. Also, miRDeep makes
ployed in any lab studying miRNA expression and the assumption that because mature sequences tend
disease. All data sets were generated by the same plat- to be more abundant in the cell than any other
form (Illumina), following the same sequencing miRNA-related sequence, reads which conform
format. The only difference between the data sets structurally to ‘mature sequences’ will likewise be
profiled in this article is their cell type and lab of the most abundant in the data file. If a read meets
origin. These data sets were used as an illustration of structural criteria for being considered a mature se-
how the programs might perform given differing ex- quence, and is found to be frequently represented in
perimental conditions and cell types. The output of the data file it receives a higher score than those that
each of these programs was compared to determine are less frequently found. miRDeep employs a flex-
consistency across algorithms, whereas inclusion of ible format, accommodating data generated by a 454
the simulated data set was used to assess each pro- Life Sciences/Roche or an Illumina/Solexa sequen-
gram’s overall specificity and sensitivity. cer [17]. Version 1 of miRDeep allows the user to
control the mapping algorithm and the program
choice for evaluation of free energy. Version 2 of
METHODS miRDeep incorporates Bowtie and Randfold for
Different types of RNAseq software these tasks [18, 19]. A key addition to the second
The basic steps behind the analysis of any deep- version 2 of miRDeep has been the consideration of
sequencing data with regard to miRNA prediction species conservation, e.g. a second set of miRNA
38 Williamson et al.

MiRDeep/miRDeep2
from a closely related species is required to be
included in the prediction process [20, 21].

MirAnalyzer
miRanalyzer is based on a random forest classifier
Program and uses support vector machine (SVM) mechanics

DSAP
derived from experimental data to make its predic-
tions [22, 23]. Version one of this software func-
research/research_ tioned as a web-based tool; version two now is

matics.cicbiogune
http://web.bioinfor-

Degree to which reads http://dsap.cgu.edu


against target genome, miRanalyser.php
available in a web-based and executable form. One
http://www.mdc-

.es/microRNA/
berlin.de/en/

benefit of using web-based applications is that they


Location

allow the user to analyze their results without having


teams/

.tw/
access to a large amount of computer resources. The
first version of the software targeted seven model

miRNAs are compared


non-coding databases.
species (human, mouse, rat, fruit-fly, round worm,
Predictions based on

focus on traditional

mirBase, and other


steps of biogenesis.

Posterior probability

Reads are mapped


(threshold > 0.95).
Bayesian probability,

zebra fish and dog); newer versions of the program

distribution, expression examples. Known


match known
have incorporated plant genomes and predictions

to miRBase
based on plant models [22, 23]. Like miRDeep2,
miRanalyzer uses the program Bowtie to map
input reads to the target genome. Apart from spe-
cifying the number of allowable mismatches, and the
Novel, Known miRNA
prediction. Status of
Novel, known miRNA

acceptable P level for a credible prediction, the user,


Uses SuperMatcher to prediction, species
determined by the
predictions (novel/
known) must be

however, is restricted from any other major changes


Table 2: Basic features of popular software used to predict miRNA from deep-sequencing data

Multiple genomes, fixed Fixed, cluster approach, Known miRNA

in the algorithm.
prediction.
Functions

The current version of deep-sequencing small


user.

level

RNA analysis pipeline (DSAP) differs from


miRDeep or miRanalyzer in that it does not require
acceptable mismatches

a target genome; reads are, instead, clustered into


can set the number of
Flexible, Oligomap (v1)
Mapping algorithm

unique groups and mapped onto the existing RNA


Seven genomes (human, Fixed, BowTie. User

increase speed

families database (e.g. RFAM) and miRNA databases


Bowtie (v2).

to determine status [24]. By eliminating the target


genome, the program improves processing speeds
zebra fish), fixed choice (<2).

considerably when compared with miRanalyzer


and miRDeep; it is, however, restricted in its use
by only being able to predict known miRNA signa-
fruit fly, rat, mouse,
dog, nematode, and

choice over version

tures. Also, DSAP uses a different mapping algo-


Target genomes

Flexible, Human

over version.

rithm, Supermatcher from the EMBOSS tool kit


(GRCh37).

to increase processing speed [25].

Data sets used


Tag removal/processing
eliminates redundancy.

miRAnalyzer. Adapter
must be done by user

Two types of data sets (experimental and simulated)


sequences can be left
Accepts two multifasta

must be removed by
Read pre-processing

read and counts. Tag


format and file with

Accepts read/counts

were used in comparing the software performance of


Provides script that

prior to analysis.

miRDeep, miRanalyzer and DSAP. The first experi-


format like

mental data set, derived from an in-house deep-


intact

sequencing experiment profiled a neuroblastoma


user.

cell line (NB; ATCC: crl-2271) The remaining ex-


perimental data sets representing a peripheral mono-
nuclear blood cell line (PMBC), a chronic
in-house computa-
Executable requires

tional resources.

myelogenous leukemia cell line (K562), acute pro-


myelogenous leukemia cell line (HL60) and a breast
Accessible

Web-based
Web based

cancer cell line, respectively [26, 34], were down-


loaded from Geo Omnibus (GSM 494809, 494810,
494811, 494812, 715665) and pre-cleaned of
Detecting miRNAs in deep-sequencing data 39

Table 3: Parameters used by Flux Simulator to cre- percentage (Figure 1). In contrast, surprisingly,
ated simulation NGS RNAseq data set miRDeep v2 appeared on average to utilize only
20 % of its reads. This difference in the numbers of
READ_LENGTH 35
reads used was undoubtedly a result of the mapping
TSS_MEAN 25
READ_NUMBER 5 000 000 algorithms applied by the respective programs and the
NB_MOLECULES 5 000 000 lack of a target genome used by DSAP. The difference
GC_SD 0.1 between miRanalyzer and miRDeep v2 may have
GC_MEAN 0.5
SIZE_SAMPLING AC been due to differences in parameters used to drive
FRAG_SUBSTRATE RNA Bowtie. Under certain circumstances, a higher per-
FRAG_METHOD UR centage of mapped reads may be preferable to the
FRAG_EZ_MOTIF NlaIII
PAIRED_END FALSE
user as it indicates a larger portion of the available in-
formation utilized by the program.

Numbers of known miRNAs and novel


universal adapters and redundant sequences. The candidate predictions
neuroblastoma data set was prepared for analysis After adjusting for the prediction size, software
with Perl scripts written in-house. The simulated
comparison between miRDeep, miRDeep2,
data set was created using Flux Simulator (http:// miRanalyzer and DSAP showed a > 80% similarity
flux.sammeth.net/) [35]. The parameters used by of known miRNAs in each of the six biological data
Flux Simulator to create the simulation can be sets (Figure 2). In all cases, except the neuroblastoma
found in Table 3. In addition, 100 known
data set and the simulated data set, miRDeep 2 gen-
miRNAs (mirBase v16) were selected to ‘spike in’
erated slightly higher numbers of known miRNAs
the simulation at a prevalence of 0.1% in order to and the additional miRNAs identified were most
provide a metric against which ROC curve could be often a miRNA from the same family and/or precur-
built [27]. Only miRNAs which did not cluster to-
sor sequence. In the case of the novel miRNA candi-
gether were selected in order to minimize the stat-
dates, however, there was a lower percent overlap in
istics inflation resulting from detecting miRNAs with
the predictions; particularly, between miRAnalyzer
similar sequence characteristics, e.g. from the same
and miRDeep/miRDeep2 suggesting that perhaps
family. The ROC curve was based on the ability of
in comparison to miRDeep, miRAnalyzer is better
each program to correctly identify these ‘spiked in’
suited to detect low-expressed candidates (Figure 3).
examples as miRNA candidates.
As abundance is linked to detection in the miRDeep
In comparing the results from each prediction data
algorithm, novel candidates represented by low abun-
set, we chose to work with only those reads that
dant reads may be excluded [36].
mapped perfectly to a specific locus (PM) and those
reads with only one base mismatch (MM), hopefully
reducing the potential noise created by sequencing Differences in length of hairpin
error. Known miRNAs were assessed in miRDeep Distinct differences were noted when the predicted
(v1, 2), miRanalyzer and DSAP only. DSAP does novel miRNAs from miRanalyzer and miRDeep
not generate novel predictions and so could only be were compared in the data sets. On the whole, re-
compared with miRDeep and miRanalyzer in terms flective of algorithm differences, the average hairpin
of the known miRNA prediction [24]. Novel predic- length predicted by miRanalyzer was 20 bases longer
tions which overlapped between miRDeep v1, 2 and than that of miRDeep. In both programs, hairpin
miRanalyzer were experimentally validated using length is set as an arbitrary number of bases flanking
real-time PCR. the mature sequence which may be acceptable as
hairpin length has been proven to be quite variable
in both plants and animals [20–23]. In contrast, the
RESULTS length of the mature sequences varied little when
Uniquely mapped reads used to make each of the three data sets was analyzed with
predictions miRDeep, miRanalyzer and DSAP. This is unsur-
When the percentage of reads used by each most prising as the determination of the mature miRNA is
current program in all data sets was compared, based on the detected read. Curiously enough,
DSAP and miRanalyzer appeared to retain the highest though, we did observe variability in the 30 end
40 Williamson et al.

Figure 1: Percentage of mapped reads. A usable read was defined as one which mapped uniquely to a specific locus.

Figure 2: Total numbers of miRNAs detected by miRanalyzer, DSAP, miRDeep and miRDeep2 already identified in
MiRBase.

(3 nt) of the mature sequences in the neuroblast- ROC curves for simulated data set
oma data set when it was analyzed by miRDeep v1. A non-redundant data set containing 733 494 reads
This variability has not been seen in the miRNAs was evaluated with miRDeep, miRDeep2,
generated by miRDeepv2 and may have been miRanalyzer and DSAP on the basis of their ability
unique to the software edition. to correctly identify 100 known miRNAs that was
Detecting miRNAs in deep-sequencing data 41

Figure 3: Total numbers of miRNAs detected by miRanalyzer, miRDeep and miRDeep2 not present in MiRBase.

Experimental validation of overlapping


novel predictions
To determine how effective the programs were at
identifying novel miRNAs, we chose predictions
that overlapped in each of the four programs from
our neuroblastoma data set and validated the pres-
ence of these novel miRNAs with Taqman
RT-PCR. Of the 16 that were identified as over-
lapping, 12 novel miRNAs were validated success-
fully. Six of these 12 novel miRNAs validated by us,
however, have been since reported by other re-
searchers. In comparing the Cq values of this group
to ours, we noticed that in our sample the Cq values
were in fact much lower. The differences may be
reflective of cell line differences as these six previ-
Figure 4: ROC curve generated using simulated data. ously reported miRNA were first identified in
fibroblasts.
We also attempted to validate the precursor seq-
‘spiked in’ randomly at a prevalence of 0.1%. uence associated with each predicted novel miRNA
Whether a prediction could be termed true or false to determine the accuracy with which each program
was based on: (i) being predicted as miRNA and (ii) could predict precursor sequences. We tested both
being mapped to the correct location. The ROC the precursors generated by miRanalyzer and
curves generated for the simulated data set showed miRDeep of the remaining six novel miRNAs and
miRDeep/miRDeep2 to demonstrate slightly better only two generated by miRDeep were verified
levels of specificity than miRanalyzer and DSAP. (prd-mir-7, prd-mir-14). The hairpins predicted by
Based on the simulation data, accuracy levels for miRDeep and miRanalyzer in many cases were dis-
each test were calculated at 80.4 and 75.4% for continuous representations of each other. The pre-
miRDeep and miRDeep2, respectively. The accur- dicted hairpin size varied when compared between
acy level for miRanalyzer was 68.3% and the accur- miRanalyzer and miRDeep and this variability un-
acy level for DSAP was 67.3% 9 (Figure 4). doubtedly impacted our ability to verify efficiently
42 Williamson et al.

Table 4: Average CQ values The size of the target as well as its apparent degree
of complexity both are likely to impact the differ-
Sample average_CQ average_ annotated as ences in miRNA prediction through the intermedi-
CQ_precursor (MirBase)
ate effect of mapped reads [31]. Mapping programs,
prd_mat-1 28.93 no amplification such as Bowtie, have been cited to randomly assign
prd_mat-2 35.00 no amplification Hsa-mir-3660
prd_mat-3 36.03 no amplification Hsa-mir- 4428
reads to incorrect locations if there is ambiguity [31].
prd_mat-5 29.89 no amplification Each of the three programs employed a different
prd_mat- 6 19.10 no amplification approach to mapping which may account for the
prd_mat-7 26.91 31.53
differences in stringency. The effect of mapping
prd_mat- 8 34.10 no amplification
prd_mat-11 32.25 no amplification Hsa-mir-3131 technique can clearly be seen when miRDeep v1 is
prd_mat-13 35.45 no amplification Hsa-mir- 4421 compared with miRDeep2 (Figure 5). When broken
prd_mat-14 32.52 34.11
down into separate tasks, the amount of time spent
prd_mat-16 31.65 no amplification Hsa-mir-2110
prd_mat-17 36.10 no amplification Hsa-mir- 4222 by miRDeepv1 to map the reads to the target
genome was 20% longer than that of miRDeep2.
Twelve of the 17 overlapping novel candidates were validated with Further, phenomena such as cross mapping can serve
Taqman RT-PCR. to confuse the mapping of the read to the precursor
[29, 30].
Until now, it has been difficult to compare the
the novel precursor predictions using Taqman assays. performance of these programs because of the lack
Also, curiously enough, in the six novel miRNAs of available data sets. We have not had the oppor-
identified in our study but annotated by others, the tunity to observe the effect that mapping algorithm
coordinates of the respective precursors predicted by might have on miRNA prediction. Now, there are
miRDeep/miRanalyzer and the annotated coordin- enough data sets available for software testing, the
ates differed by over 35 000 bases. This discrepancy tools with which we analyze these data sets can be
may have been due to cross mapping events, or more refined even further and perfected. The choice of
likely, is evidence of inaccurate precursor prediction Bowtie was undoubtedly due to practical consider-
[27, 28]. ations; use of Bowtie does, in fact, speed up the
The Cq values in Table 4 suggest that the process of analysis. On average, miRDeepv1 took
novel miRNA identified by this study are expressed three times as long to complete its analysis (10.5 h)
at low levels, and therefore more difficult to detect compared with that of miRDeep2 (2.87 h) on a
by more traditional methods. The low expression T5500 Dell workstation running Ubuntu 12.04
level of these novel candidates is unsurprising. (Table 5; Figure 5). Also, web-based applications,
Given the amount of work that has been devoted such as DSAP and miRanalyzer, are difficult to
toward the detection and identification of novel benchmark as one’s data is usually placed in a com-
miRNA candidates in the last few years, it is unlikely pute queue. The ability to facilitate increased speed
that any new highly expressed candidates will be may not always be advantageous as incorrect map-
found. ping may lead to increased false-positives findings.
Given the time and cost involved in validating
predicted miRNA, however, it is prudent to use a
DISCUSSION consensus approach to miRNA prediction with an
At first, we intended to use the additional programs intersection of the mapping results from a number of
(DSAP, miRanalyzer) to validate predictions gener- different programs rather than results from one single
ated by miRDeep; however, in comparing the program. An iterative-mapping profile could be gen-
output, we realized that the software dramatically erated from multiple programs that would enable the
affects the number and quality of predictions gener- user to identify the reads that map to multiple loca-
ated. Several conceptual questions arose from our tions and also those regions of the target genome that
comparison, particularly with regard to the deter- might be pre-disposed toward such activity. Reads
mination of the hairpin sequence and the mapping with high-quality base scores throughout that map to
algorithm used by each program. It has been sug- the same single location regardless of program could
gested that certain programs vary in terms of their then be carried forward to predict miRNAs. An add-
mapping accuracy of short reads (<35 bases) [31]. itional step that might be useful to consider is the
Detecting miRNAs in deep-sequencing data 43

Figure 5: Average compute time spent by miRDeep and miRDeep2 on analyzing data sets.

excluded by the software or worse yet, to be incor-


Table 5: Calculation time in hours taken to complete
rectly mapped altogether. We suggest that algorithms
analysis
such as employed by the program iCorn could ef-
Data set miRAnalyzer miRDeepv1 miRDeepv2 DSAP
fectively be implemented in the mapping process to
increase the amount of potential information gar-
PMBC1 12 13 3 7
nered in the analysis process. iCorn iteratively maps
PMBC2 7 8 2.5 5
NB 7 9 4 5 reads to the target genome, adjusting the sequence of
K562 5 13 3 7 the target genome if the mismatch is caused by a base
HL60 8 10 2.5 7
with a good quality score and that the adjustment/
Breast Cancer 5 8 2 8
Simulated 13 12 3 9 read mapping would increase overall coverage [32].
To the best of our knowledge, none of the programs
discussed in this article address the overall accuracy of
accuracy of the reference genome itself in relation to the reference sequence used, and indeed, we our-
the reads which are being mapped. Reference gen- selves did not take this into account when doing
omes, by virtue of their composition, vary consider- our analysis. Here, we only suggest this additional
ably both in terms of quality and base accuracy [28]. step as a way to further improve the accuracy of
Both miRDeep and miRanalyzer rely on a reference the mapping process.
genome and generally exclude reads that do not map One area in which miRDeep and miRanalyzer
cleanly to the reference genome (less than two base both demonstrate apparent weakness is lack of
mismatches). Base inaccuracies in the reference specificity to detect the precursor sequence. When
genome might inadvertently cause reads to be examining the novel miRNAs predicted by
44 Williamson et al.

miRDeep and miRanalyzer, we detected two in- A limitation of our study is indeed the number of
stances where precursors were predicted poorly in data sets studied and the number of programs com-
relation to the mature sequence. In the first, novel pared, but, nevertheless, it does suggest that caution
predictions that overlapped between miRanalyzer is necessary when using this type of sequencing for
and miRDeep demonstrated discontinuous precur- miRNA prediction.
sors. Each predicted precursor shared the mapped
read but the boundaries of the predicted precursor
varied. In the second instance, six novel miRNA Key Points
candidates which had already been detected by  A number of programs are now available that can be used to pre-
other authors were predicted to map to loci entirely dict miRNAs from RNAseq data sets.
 These programs vary in terms of the resources/skill needed to
different from previously reported. It is apparent that implement successfully.
additional information is needed with respect to the  A comparison of three programs suggests that although similar
precursor sequence itself before acceptable prediction groups are predicted, the programs varied in terms of predicted
candidates.
parameters can be employed in detection software.  Despite an apparent high stringency, miRDeep appears to be the
Actual hairpin length varies from 60 to 120 nt in best algorithm for those researchers wishing to pursue novel
annotated examples. The current miRNA-based miRNA for further experimentation as its design allows the re-
searcher to address concerns such as mapping efficiency.
deep-sequencing methodology focuses solely on
the mature sequence and the precursor prediction
is generally a theoretical extraction based on the
information provided by the mapped read. We rec- FUNDING
ommend that more methods both experimental Stanley Medical Research Institute (#08R-1959,
(deep-sequencing data generation) and computa- 2008) and the Jeffers Foundation (#J-1015, 2011)
tional (addressing precursor sequence motifs and grants to V.V.
folding) need be devised to resolve the apparent dis-
crepancy in detection of miRNA precursors [31].

References
CONCLUSION 1. Krol J, Krzyzosiak WJ. Structural aspects of microRNA
Deep sequencing does pose considerable computa- biogenesis. IUBMB Life 2004;56:95–100.
tional and analytical challenges that must be over- 2. Li Y, Lin L, Jin P. The microRNA pathway and fragile X
come before it can become a fully realized form of mental retardation protein. Biochim Biophys Acta 2008;1779:
702–5.
analysis in miRNA research. Apart from the tech-
3. Mencia A, Modamio-Hoybjor S, Redshaw N, et al.
nical issues raised by different platforms, researchers Mutations in the seed region of human miR-96 are respon-
also must be aware of the impact that their choice of sible for nonsyndromic progressive hearing loss. Nat Genet
the program might have on their analysis. We be- 2009;41:609–13.
lieve that, for the moment, miRDeep represents the 4. Scalbert E, Bril A. Implication of microRNAs in the car-
diovascular system. Curr Opin Pharmacol 2008;8:181–8.
best solution for researchers looking for novel can-
5. Oulas A, Boutla A, Gkirtzou K, et al. Prediction of novel
didates to pursue as its stringency level reduces the microRNA genes in cancer-associated genomic regions—a
number of false-positive generated. Therefore, care- combined computational and experimental approach.
ful consideration of deep sequencing results is vital Nucleic Acids Res 2009;37:3276–87.
both with respect to the mature as well as the hairpin 6. Saetrom P, Snove O, Nedland M, et al. Conserved
MicroRNA characteristics in mammals. Oligonucleotides
sequences. A large research effort, until now, has 2006;16:115–44.
been predominantly devoted to detection of 7. Kim VN. MicroRNA biogenesis: coordinated cropping and
miRNA mature sequences, but not enough effort dicing. Nat Rev Mol Cell Biol 2005;6:376–85.
has been devoted to detection of their intermediate 8. Kim YK, Kim VN. Processing of intronic microRNAs.
precursor forms. Added experimental information EMBO J 2007;26:775–83.
regarding the intermediate precursor can serve to 9. Rodriguez A, Griffiths-Jones S, Ashurst JL, et al.
help refine the detection process. The amount of Identification of mammalian microRNA host genes and
transcription units. Genome Res 2004;14:1902–10.
information generated through these initial studies
10. Erdmann VA, Szymanski M, Hochberg A, Groot N,
is now of sufficient size that a correct assessment Barciszewski J. Non-coding, mRNA-like RNAs database
can be made of the techniques used to generate it. Y2K. Nucleic Acids Res 2000;28:197–200.
Detecting miRNAs in deep-sequencing data 45

11. Sinha S, Vasulu TS, De RK. Performance and evaluation of 24. Huang P, Liu Y, Lee C, et al. DSAP: deep-sequencing small
MicroRNA gene identification tools. J Proteom Bioinform RNA analysis pipeline. Nucleic Acids Res 2010;38:W385–91.
2009;2:336–43. 25. Rice P, Longden I, Bleasby A. EMBOSS: the European
12. Borchert GM, Lanier W, Davidson BL. RNA polymerase molecular biology open software suite. Trends Genet 2000;
III transcribes human microRNAs. Nat Struct Mol Biol 2006; 16(6):276–7.
13:1097–101. 26. Vaz C, Ahmad HM, Sharma P, et al. Analysis of microRNA
13. Jiang P, Wu H, Wang W, et al. MiPred: classification of real transcriptome by deep sequencing of small RNA libraries of
and pseudo microRNA precursors using random forest pre- peripheral blood. BMC Genom 2010;11:288–306.
diction model with combined features. Nucleic Acids Res 27. Griffiths-Jones S, Saini HK, van Dongen S, et al. miRBase:
2007;35:W339–44. tools for microRNA genomics. Nucleic Acids Res 2008;36:
14. Eaves HL, Gao Y. MOM: maximum oligonucleotide map- D154–8.
ping. Bioinformatics 2009;25:969–70. 28. Thakur V, Wanchana S, Xu M, et al. Characterization of
15. Berezikov E, Guryev V, van de Belt J, et al. Phylogenetic statistical features for plant microRNA prediction. BMC
shadowing and computational identification of human Genom 2011;12:108–20.
microRNA genes. Cell 2005;120:21–4. 29. de Hoon MJ, Taft RJ, Hashimoto T, et al. Cross-mapping
16. Mendes ND, Freitas AT, Sagot MF. Current tools for the and the identification of editing sites in mature microRNAs
identification of miRNA genes and their targets. Nucleic in high-throughput sequencing libraries. Genome Res 2010;
Acids Res 2009;37:2419–33. 20:257–64.
17. Friedlander MR, Chen W, Adamidi C, et al. Discovering 30. Guo L, Liang T, Gu W, et al. Cross-mapping events in
microRNAs from deep sequencing data using miRDeep. miRNAs reveal potential miRNA-mimics and evolutionary
Nat Biotechnol 2008;26:407–15. implications. PloS One 2011;6:e20517–24.
18. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and 31. Palmieri N, Schlotterer C. Mapping accuracy of short reads
memory-efficient alignment of short DNA sequences to the from massively parallel sequencing and the implications for
human genome. Genome Biol 2009;10(3):R25. quantitative expression profiling. PloS One 2009;4:
19. Bonnet E, Wuyts J, Rouze P, Van de Peer Y. Evidence that e6323–33.
microRNA precursors, unlike other non-coding RNAs, 32. Otto TD, Sanders M, Berriman M, Newbold C. Iterative
have lower folding free energies than random sequences. correction of reference nucleotides (iCORN) using second
Bioinformatics 2004;20(17):2911–7. generation sequencing technology. Bioinformatics 2010;26:
20. Friedlander MR, Mackowiak SD, Li N, et al. miRDeep2 1704–7.
accurately identifies known and hundreds of novel 33. Zhu E, Zhao F, Xu G, Hou H, Zhou L, Li X, et al.
microRNA genes in seven animal clades. Nucleic Acids Res mirTools: microRNA profiling and discovery based on
2012;40(1):37–52. high-throughput sequencing. Nucleic Acids Res 2010;
21. Mackowiak SD. Identification of novel and known 38(Web Server issue):W392–7.
miRNAs in deep-sequencing data with miRDeep2. In: 34. Farazi TA, Horlings HM, Ten Hoeve JJ, et al. MicroRNA
Baxevanis AD, et al, (ed). Current Protocols in Bioinformatics. sequence and expression analysis in breast tumors by deep
Chapter 12, Unit 12.10. John E Wiley and Sons, 2011. sequencing. Cancer Res 2011;71(13):4443–53.
22. Hackenberg M, Sturm M, Langenberger D, et al. 35. Creighton CJ, Reid JG, Gunaratne PH. Expression profil-
miRanalyzer: a microRNA detection and analysis tool for ing of microRNAs by deep sequencing. Brief Bioinform
next-generation sequencing experiments. Nucleic Acids Res 2009;10(5):490–7.
2009;37:W68–76. 36. Howard BE, Heber S. Towards reliable isoform quanti-
23. Hackenberg M, Rodriguez-Ezpeleta N, Aransay AM. fication using RNA-SEQ data. BMC Bioinform 2010;
miRanalyzer: an update on the detection and analysis of 11(Suppl 3):S6.
microRNAs in high-throughput sequencing experiments.
Nucleic Acids Res 2011;39(Web Server issue):W132–8.

You might also like