Professional Documents
Culture Documents
1093/bib/bbs010
Advance Access published on 24 March 2012
Abstract
Deep sequencing has become a popular tool for novel miRNA detection but its data must be viewed carefully as the
state of the field is still undeveloped. Using three different programs, miRDeep (v1, 2), miRanalyzer and DSAP, we
have analyzed seven data sets (six biological and one simulated) to provide a critical evaluation of the programs per-
formance. We selected these software based on their popularity and overall approach toward the detection of
novel and known miRNAs using deep-sequencing data. The program comparisons suggest that, despite differing
stringency levels they all identify a similar set of known and novel predictions. Comparisons between the first and
second version of miRDeep suggest that the stringency level of each of these programs may, in fact, be a result of
the algorithm used to map the reads to the target. Different stringency levels are likely to affect the number of pos-
sible novel candidates for functional verification, causing undue strain on resources and time. With that in mind,
we propose that an intersection across multiple programs be taken, especially if considering novel candidates that
will be targeted for additional analysis. Using this approach, we identify and performed initial validation of 12 novel
predictions in our in-house data with real-time PCR, six of which have been previously unreported.
Keywords: deep sequencing; software; miRNA detection; comparison
Corresponding authors: Vernell Williamson and Vladimir Vladimirov, Department of Psychiatry, Virginia Institute for
Psychiatric and Behavioral Genetics, Medical College of Virginia of Virginia Commonwealth University, Richmond, VA, USA.
Tel: þ1 804 628 7607; Fax: þ1 804 828 1471; E-mail: vswilliamson@vcu.edu; vivladimirov@vcu.edu
Vernell Williamson is a PhD student in Integrative Life Sciences at Virginia Commonwealth University. Her research interests
include detection of novel miRNA and integration of large data sets.
Albert Kim received his PhD in June, 2011 from Virginia Commonwealth University. He is currently finishing his medical training
with the same university.
Bin Xie is affiliated with the division of Genomics, Epigenomics and Bioinformatics, Lieber Institute for Brain Development,
Baltimore, Maryland, USA.
Omari McMichael is a laboratory specialist at the Virginia Institute of Psychiatric and Behavioral Genetics.
Yuan Gao is affiliated with Division of Genomics, Epigenomics, and Bioinformatics, Lieber Institute for Brain Development,
Baltimore, Maryland, USA.
VladimirVladimirov is an assistant professor affiliated with Virginia Institute for Psychiatric and Behavioral Genetics and School of
Pharmacy at Virginia Commonwealth University. His research interests include studying miRNA, gene expression and epigenetics of
psychiatric disorders.
ß The Author 2012. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com
Detecting miRNAs in deep-sequencing data 37
Table 1: Other programs that may be used to predict miRNAs from deep-sequencing data
generation and specialized skills required to ad- can be summarized in three stages: (i) initial mapping
equately analyze and interpret data. Researchers of the read; (ii) expansion of the mapped locus to
interested in using deep-sequencing techniques for include flanking sequences; and (iii) evaluation of the
miRNA discovery are often confronted with a col- expanded sequence on the basis of negative free
lection of bioinformatic algorithms and approaches energy and structure. Each stage in the process ul-
with very little information on software comparability timately affects the final result, with perhaps the first
and performance. Here, we review three programs, being the most crucial of all. The programs available
which use three different approaches for miRNA de- today vary in terms on the amount of user control
tection in deep-sequencing data. We use experimen- over parameters and input and how they address
tal and simulated data to highlight some of their each of these three stages [17, 20–24] (Table 2).
features and characteristics. The programs profiled, miRDeep (v1, v2) is a program that predicts the
in this article, were chosen on the basis of their popu- presence of miRNA from deep-sequencing data
larity as evidenced by number of citations and using Bayesian probabilities framed on the classic
uniqueness of approach. In addition, we also compare steps of miRNA biogenesis [17–19]. The pipeline
two versions of miRDeep software which is the most first compares the reads to a target genome, and
popular program for miRNA discovery in use today then evaluates the read’s suitability on a thermo-
and in fact, two other detection programs (miRTools dynamic scale. The algorithm assumes that if a read
and miReNA) have also incorporated miRDeep as a is related to miRNA, then it must either be a portion
component of their process [33]. A list of additional of a star, a loop sequence or a mature sequence. The
available software can be found in Table 1. The cell read must demonstrate characteristics similar to al-
lines used in this study were randomly chosen and ready annotated examples, e.g. definite evidence of
represent cells that might otherwise be regularly em- a present 2 nt 30 overhang. Also, miRDeep makes
ployed in any lab studying miRNA expression and the assumption that because mature sequences tend
disease. All data sets were generated by the same plat- to be more abundant in the cell than any other
form (Illumina), following the same sequencing miRNA-related sequence, reads which conform
format. The only difference between the data sets structurally to ‘mature sequences’ will likewise be
profiled in this article is their cell type and lab of the most abundant in the data file. If a read meets
origin. These data sets were used as an illustration of structural criteria for being considered a mature se-
how the programs might perform given differing ex- quence, and is found to be frequently represented in
perimental conditions and cell types. The output of the data file it receives a higher score than those that
each of these programs was compared to determine are less frequently found. miRDeep employs a flex-
consistency across algorithms, whereas inclusion of ible format, accommodating data generated by a 454
the simulated data set was used to assess each pro- Life Sciences/Roche or an Illumina/Solexa sequen-
gram’s overall specificity and sensitivity. cer [17]. Version 1 of miRDeep allows the user to
control the mapping algorithm and the program
choice for evaluation of free energy. Version 2 of
METHODS miRDeep incorporates Bowtie and Randfold for
Different types of RNAseq software these tasks [18, 19]. A key addition to the second
The basic steps behind the analysis of any deep- version 2 of miRDeep has been the consideration of
sequencing data with regard to miRNA prediction species conservation, e.g. a second set of miRNA
38 Williamson et al.
MiRDeep/miRDeep2
from a closely related species is required to be
included in the prediction process [20, 21].
MirAnalyzer
miRanalyzer is based on a random forest classifier
Program and uses support vector machine (SVM) mechanics
DSAP
derived from experimental data to make its predic-
tions [22, 23]. Version one of this software func-
research/research_ tioned as a web-based tool; version two now is
matics.cicbiogune
http://web.bioinfor-
.es/microRNA/
berlin.de/en/
.tw/
access to a large amount of computer resources. The
first version of the software targeted seven model
focus on traditional
Posterior probability
to miRBase
based on plant models [22, 23]. Like miRDeep2,
miRanalyzer uses the program Bowtie to map
input reads to the target genome. Apart from spe-
cifying the number of allowable mismatches, and the
Novel, Known miRNA
prediction. Status of
Novel, known miRNA
in the algorithm.
prediction.
Functions
level
increase speed
Flexible, Human
over version.
miRAnalyzer. Adapter
must be done by user
must be removed by
Read pre-processing
Accepts read/counts
prior to analysis.
tional resources.
Web-based
Web based
Table 3: Parameters used by Flux Simulator to cre- percentage (Figure 1). In contrast, surprisingly,
ated simulation NGS RNAseq data set miRDeep v2 appeared on average to utilize only
20 % of its reads. This difference in the numbers of
READ_LENGTH 35
reads used was undoubtedly a result of the mapping
TSS_MEAN 25
READ_NUMBER 5 000 000 algorithms applied by the respective programs and the
NB_MOLECULES 5 000 000 lack of a target genome used by DSAP. The difference
GC_SD 0.1 between miRanalyzer and miRDeep v2 may have
GC_MEAN 0.5
SIZE_SAMPLING AC been due to differences in parameters used to drive
FRAG_SUBSTRATE RNA Bowtie. Under certain circumstances, a higher per-
FRAG_METHOD UR centage of mapped reads may be preferable to the
FRAG_EZ_MOTIF NlaIII
PAIRED_END FALSE
user as it indicates a larger portion of the available in-
formation utilized by the program.
Figure 1: Percentage of mapped reads. A usable read was defined as one which mapped uniquely to a specific locus.
Figure 2: Total numbers of miRNAs detected by miRanalyzer, DSAP, miRDeep and miRDeep2 already identified in
MiRBase.
(3 nt) of the mature sequences in the neuroblast- ROC curves for simulated data set
oma data set when it was analyzed by miRDeep v1. A non-redundant data set containing 733 494 reads
This variability has not been seen in the miRNAs was evaluated with miRDeep, miRDeep2,
generated by miRDeepv2 and may have been miRanalyzer and DSAP on the basis of their ability
unique to the software edition. to correctly identify 100 known miRNAs that was
Detecting miRNAs in deep-sequencing data 41
Figure 3: Total numbers of miRNAs detected by miRanalyzer, miRDeep and miRDeep2 not present in MiRBase.
Table 4: Average CQ values The size of the target as well as its apparent degree
of complexity both are likely to impact the differ-
Sample average_CQ average_ annotated as ences in miRNA prediction through the intermedi-
CQ_precursor (MirBase)
ate effect of mapped reads [31]. Mapping programs,
prd_mat-1 28.93 no amplification such as Bowtie, have been cited to randomly assign
prd_mat-2 35.00 no amplification Hsa-mir-3660
prd_mat-3 36.03 no amplification Hsa-mir- 4428
reads to incorrect locations if there is ambiguity [31].
prd_mat-5 29.89 no amplification Each of the three programs employed a different
prd_mat- 6 19.10 no amplification approach to mapping which may account for the
prd_mat-7 26.91 31.53
differences in stringency. The effect of mapping
prd_mat- 8 34.10 no amplification
prd_mat-11 32.25 no amplification Hsa-mir-3131 technique can clearly be seen when miRDeep v1 is
prd_mat-13 35.45 no amplification Hsa-mir- 4421 compared with miRDeep2 (Figure 5). When broken
prd_mat-14 32.52 34.11
down into separate tasks, the amount of time spent
prd_mat-16 31.65 no amplification Hsa-mir-2110
prd_mat-17 36.10 no amplification Hsa-mir- 4222 by miRDeepv1 to map the reads to the target
genome was 20% longer than that of miRDeep2.
Twelve of the 17 overlapping novel candidates were validated with Further, phenomena such as cross mapping can serve
Taqman RT-PCR. to confuse the mapping of the read to the precursor
[29, 30].
Until now, it has been difficult to compare the
the novel precursor predictions using Taqman assays. performance of these programs because of the lack
Also, curiously enough, in the six novel miRNAs of available data sets. We have not had the oppor-
identified in our study but annotated by others, the tunity to observe the effect that mapping algorithm
coordinates of the respective precursors predicted by might have on miRNA prediction. Now, there are
miRDeep/miRanalyzer and the annotated coordin- enough data sets available for software testing, the
ates differed by over 35 000 bases. This discrepancy tools with which we analyze these data sets can be
may have been due to cross mapping events, or more refined even further and perfected. The choice of
likely, is evidence of inaccurate precursor prediction Bowtie was undoubtedly due to practical consider-
[27, 28]. ations; use of Bowtie does, in fact, speed up the
The Cq values in Table 4 suggest that the process of analysis. On average, miRDeepv1 took
novel miRNA identified by this study are expressed three times as long to complete its analysis (10.5 h)
at low levels, and therefore more difficult to detect compared with that of miRDeep2 (2.87 h) on a
by more traditional methods. The low expression T5500 Dell workstation running Ubuntu 12.04
level of these novel candidates is unsurprising. (Table 5; Figure 5). Also, web-based applications,
Given the amount of work that has been devoted such as DSAP and miRanalyzer, are difficult to
toward the detection and identification of novel benchmark as one’s data is usually placed in a com-
miRNA candidates in the last few years, it is unlikely pute queue. The ability to facilitate increased speed
that any new highly expressed candidates will be may not always be advantageous as incorrect map-
found. ping may lead to increased false-positives findings.
Given the time and cost involved in validating
predicted miRNA, however, it is prudent to use a
DISCUSSION consensus approach to miRNA prediction with an
At first, we intended to use the additional programs intersection of the mapping results from a number of
(DSAP, miRanalyzer) to validate predictions gener- different programs rather than results from one single
ated by miRDeep; however, in comparing the program. An iterative-mapping profile could be gen-
output, we realized that the software dramatically erated from multiple programs that would enable the
affects the number and quality of predictions gener- user to identify the reads that map to multiple loca-
ated. Several conceptual questions arose from our tions and also those regions of the target genome that
comparison, particularly with regard to the deter- might be pre-disposed toward such activity. Reads
mination of the hairpin sequence and the mapping with high-quality base scores throughout that map to
algorithm used by each program. It has been sug- the same single location regardless of program could
gested that certain programs vary in terms of their then be carried forward to predict miRNAs. An add-
mapping accuracy of short reads (<35 bases) [31]. itional step that might be useful to consider is the
Detecting miRNAs in deep-sequencing data 43
Figure 5: Average compute time spent by miRDeep and miRDeep2 on analyzing data sets.
miRDeep and miRanalyzer, we detected two in- A limitation of our study is indeed the number of
stances where precursors were predicted poorly in data sets studied and the number of programs com-
relation to the mature sequence. In the first, novel pared, but, nevertheless, it does suggest that caution
predictions that overlapped between miRanalyzer is necessary when using this type of sequencing for
and miRDeep demonstrated discontinuous precur- miRNA prediction.
sors. Each predicted precursor shared the mapped
read but the boundaries of the predicted precursor
varied. In the second instance, six novel miRNA Key Points
candidates which had already been detected by A number of programs are now available that can be used to pre-
other authors were predicted to map to loci entirely dict miRNAs from RNAseq data sets.
These programs vary in terms of the resources/skill needed to
different from previously reported. It is apparent that implement successfully.
additional information is needed with respect to the A comparison of three programs suggests that although similar
precursor sequence itself before acceptable prediction groups are predicted, the programs varied in terms of predicted
candidates.
parameters can be employed in detection software. Despite an apparent high stringency, miRDeep appears to be the
Actual hairpin length varies from 60 to 120 nt in best algorithm for those researchers wishing to pursue novel
annotated examples. The current miRNA-based miRNA for further experimentation as its design allows the re-
searcher to address concerns such as mapping efficiency.
deep-sequencing methodology focuses solely on
the mature sequence and the precursor prediction
is generally a theoretical extraction based on the
information provided by the mapped read. We rec- FUNDING
ommend that more methods both experimental Stanley Medical Research Institute (#08R-1959,
(deep-sequencing data generation) and computa- 2008) and the Jeffers Foundation (#J-1015, 2011)
tional (addressing precursor sequence motifs and grants to V.V.
folding) need be devised to resolve the apparent dis-
crepancy in detection of miRNA precursors [31].
References
CONCLUSION 1. Krol J, Krzyzosiak WJ. Structural aspects of microRNA
Deep sequencing does pose considerable computa- biogenesis. IUBMB Life 2004;56:95–100.
tional and analytical challenges that must be over- 2. Li Y, Lin L, Jin P. The microRNA pathway and fragile X
come before it can become a fully realized form of mental retardation protein. Biochim Biophys Acta 2008;1779:
702–5.
analysis in miRNA research. Apart from the tech-
3. Mencia A, Modamio-Hoybjor S, Redshaw N, et al.
nical issues raised by different platforms, researchers Mutations in the seed region of human miR-96 are respon-
also must be aware of the impact that their choice of sible for nonsyndromic progressive hearing loss. Nat Genet
the program might have on their analysis. We be- 2009;41:609–13.
lieve that, for the moment, miRDeep represents the 4. Scalbert E, Bril A. Implication of microRNAs in the car-
diovascular system. Curr Opin Pharmacol 2008;8:181–8.
best solution for researchers looking for novel can-
5. Oulas A, Boutla A, Gkirtzou K, et al. Prediction of novel
didates to pursue as its stringency level reduces the microRNA genes in cancer-associated genomic regions—a
number of false-positive generated. Therefore, care- combined computational and experimental approach.
ful consideration of deep sequencing results is vital Nucleic Acids Res 2009;37:3276–87.
both with respect to the mature as well as the hairpin 6. Saetrom P, Snove O, Nedland M, et al. Conserved
MicroRNA characteristics in mammals. Oligonucleotides
sequences. A large research effort, until now, has 2006;16:115–44.
been predominantly devoted to detection of 7. Kim VN. MicroRNA biogenesis: coordinated cropping and
miRNA mature sequences, but not enough effort dicing. Nat Rev Mol Cell Biol 2005;6:376–85.
has been devoted to detection of their intermediate 8. Kim YK, Kim VN. Processing of intronic microRNAs.
precursor forms. Added experimental information EMBO J 2007;26:775–83.
regarding the intermediate precursor can serve to 9. Rodriguez A, Griffiths-Jones S, Ashurst JL, et al.
help refine the detection process. The amount of Identification of mammalian microRNA host genes and
transcription units. Genome Res 2004;14:1902–10.
information generated through these initial studies
10. Erdmann VA, Szymanski M, Hochberg A, Groot N,
is now of sufficient size that a correct assessment Barciszewski J. Non-coding, mRNA-like RNAs database
can be made of the techniques used to generate it. Y2K. Nucleic Acids Res 2000;28:197–200.
Detecting miRNAs in deep-sequencing data 45
11. Sinha S, Vasulu TS, De RK. Performance and evaluation of 24. Huang P, Liu Y, Lee C, et al. DSAP: deep-sequencing small
MicroRNA gene identification tools. J Proteom Bioinform RNA analysis pipeline. Nucleic Acids Res 2010;38:W385–91.
2009;2:336–43. 25. Rice P, Longden I, Bleasby A. EMBOSS: the European
12. Borchert GM, Lanier W, Davidson BL. RNA polymerase molecular biology open software suite. Trends Genet 2000;
III transcribes human microRNAs. Nat Struct Mol Biol 2006; 16(6):276–7.
13:1097–101. 26. Vaz C, Ahmad HM, Sharma P, et al. Analysis of microRNA
13. Jiang P, Wu H, Wang W, et al. MiPred: classification of real transcriptome by deep sequencing of small RNA libraries of
and pseudo microRNA precursors using random forest pre- peripheral blood. BMC Genom 2010;11:288–306.
diction model with combined features. Nucleic Acids Res 27. Griffiths-Jones S, Saini HK, van Dongen S, et al. miRBase:
2007;35:W339–44. tools for microRNA genomics. Nucleic Acids Res 2008;36:
14. Eaves HL, Gao Y. MOM: maximum oligonucleotide map- D154–8.
ping. Bioinformatics 2009;25:969–70. 28. Thakur V, Wanchana S, Xu M, et al. Characterization of
15. Berezikov E, Guryev V, van de Belt J, et al. Phylogenetic statistical features for plant microRNA prediction. BMC
shadowing and computational identification of human Genom 2011;12:108–20.
microRNA genes. Cell 2005;120:21–4. 29. de Hoon MJ, Taft RJ, Hashimoto T, et al. Cross-mapping
16. Mendes ND, Freitas AT, Sagot MF. Current tools for the and the identification of editing sites in mature microRNAs
identification of miRNA genes and their targets. Nucleic in high-throughput sequencing libraries. Genome Res 2010;
Acids Res 2009;37:2419–33. 20:257–64.
17. Friedlander MR, Chen W, Adamidi C, et al. Discovering 30. Guo L, Liang T, Gu W, et al. Cross-mapping events in
microRNAs from deep sequencing data using miRDeep. miRNAs reveal potential miRNA-mimics and evolutionary
Nat Biotechnol 2008;26:407–15. implications. PloS One 2011;6:e20517–24.
18. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and 31. Palmieri N, Schlotterer C. Mapping accuracy of short reads
memory-efficient alignment of short DNA sequences to the from massively parallel sequencing and the implications for
human genome. Genome Biol 2009;10(3):R25. quantitative expression profiling. PloS One 2009;4:
19. Bonnet E, Wuyts J, Rouze P, Van de Peer Y. Evidence that e6323–33.
microRNA precursors, unlike other non-coding RNAs, 32. Otto TD, Sanders M, Berriman M, Newbold C. Iterative
have lower folding free energies than random sequences. correction of reference nucleotides (iCORN) using second
Bioinformatics 2004;20(17):2911–7. generation sequencing technology. Bioinformatics 2010;26:
20. Friedlander MR, Mackowiak SD, Li N, et al. miRDeep2 1704–7.
accurately identifies known and hundreds of novel 33. Zhu E, Zhao F, Xu G, Hou H, Zhou L, Li X, et al.
microRNA genes in seven animal clades. Nucleic Acids Res mirTools: microRNA profiling and discovery based on
2012;40(1):37–52. high-throughput sequencing. Nucleic Acids Res 2010;
21. Mackowiak SD. Identification of novel and known 38(Web Server issue):W392–7.
miRNAs in deep-sequencing data with miRDeep2. In: 34. Farazi TA, Horlings HM, Ten Hoeve JJ, et al. MicroRNA
Baxevanis AD, et al, (ed). Current Protocols in Bioinformatics. sequence and expression analysis in breast tumors by deep
Chapter 12, Unit 12.10. John E Wiley and Sons, 2011. sequencing. Cancer Res 2011;71(13):4443–53.
22. Hackenberg M, Sturm M, Langenberger D, et al. 35. Creighton CJ, Reid JG, Gunaratne PH. Expression profil-
miRanalyzer: a microRNA detection and analysis tool for ing of microRNAs by deep sequencing. Brief Bioinform
next-generation sequencing experiments. Nucleic Acids Res 2009;10(5):490–7.
2009;37:W68–76. 36. Howard BE, Heber S. Towards reliable isoform quanti-
23. Hackenberg M, Rodriguez-Ezpeleta N, Aransay AM. fication using RNA-SEQ data. BMC Bioinform 2010;
miRanalyzer: an update on the detection and analysis of 11(Suppl 3):S6.
microRNAs in high-throughput sequencing experiments.
Nucleic Acids Res 2011;39(Web Server issue):W132–8.