Professional Documents
Culture Documents
miRDP2
Zheng Kuang
Xiaozeng Yang
Beijing Academy of Agriculture and
Forestry Sciences
The Peking University
Contact: miRDP2@sRNAworld.com
Contents
1. OVERVIEW ........................................................................................................................................... 1
2. INSTALLATION ..................................................................................................................................... 2
2.2 DOWNLOAD................................................................................................................................... 3
7. REFERENCES ...................................................................................................................................... 10
1. OVERVIEW
1.1 BACKGROUND
MicroRNAs (miRNAs) are ~21-nucleotide endogenous small RNAs (sRNAs) with potent roles in
regulating gene expression (Bartel, 2009). In the past two decades, extensive research efforts
have been devoted to identify miRNAs and study their functions, especially after the NGS
methods became available. Based on such unique features of miRNAs as stem-loop structure
and preferential accumulation of sequence reads corresponding to mature and star miRNAs,
computational tools capturing these characteristics have achieved stunning successes in
identifying miRNAs in diverse species. In the public miRNA repository miRBase, over 38,000
miRNA items are currently hosted (version 22) whereas only ~500 were stored in 2008 (version
2.0; Kozomara et al., 2014).
Previously, we have developed miRDeep-P for miRNA prediction in plant species (Yang and Li,
2011). However, miRDeep-P has shown two major drawbacks when facing complicated input
datasets, which would potentially dampen its significance in plant miRNA prediction. One is the
long running time when working on complex genomes or libraries with high sequencing depth.
The other is the relatively large amount of false positives mingling with true miRNAs, which may
severely impact subsequent analysis.
To cope with these shortcomings, we have incorporated new plant miRNA annotation criteria
(Axtell and Meyers, 2018) and overhauled the strategies and algorithm of miRDeep-P, which
lead to a significantly improved version, designated miRDeep-P2 (miRDP2). Compared to other
miRNA prediction tools, including MIReNA (Mathelier and Carbone, 2010), miRPlant (An et al.,
2014), miRPERFeR (Lei and Sun, 2014), and miRA (Evers et al., 2015), the time consumption,
sensitivity, and accuracy of miRDP2 have much advantage (details in manuscript and
supplementary materials).
1
1.2 SUMMARY OF MIRDP2 FUNCTION
Based on ultra-deep sampling of small RNA libraries by next generation sequencing, miRDP2 is
able to identify miRNA genes in plant species, even for those without detailed annotation, with
extremely high speed and reliable performance.
The basic algorithm framework of miRDP2 was inherited from miRDeep-P (Yang and Li, 2011),
while several critical modifications and novel assistant scripts have been added to the original
tool.
2. INSTALLATION
2.1 DEPENDENCIES
To run miRDP2, several dependencies are required.
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
2
2.2 DOWNLOAD
To install the miRDP2 package, simply download the two tar ball files from
https://sourceforge.net/projects/mirdp2/files/latest_version/ and extract all the contents into
one folder.
2.3 TEST
To test whether miRDP2 has been correctly installed, the user can use the test data and the
following commands to check the availability of miRDP2. The test data and the expected output
can be found in https://sourceforge.net/projects/mirdp2/files/TestData/.
The test data contains one formatted GSM sequencing file and one Arabidopsis thaliana
genome file. To test miRDP2, please follow this guide (here we use bowtie for demonstration):
(add option`-T’ or `--bowtie2’ if you prefer to use bowtie2 for reads alignment)
3
The bowtie-build command may take a while, and the miRDP2 pipeline would finish within
several minutes. A folder named ‘GSM2094927-15-0-10’ should be automatically generated in
<user_selected_folder>, containing all intermediate files and results. GSM2094927-15-0-
10_filter_P_prediction is the final output of predicted miRNAs. The file is tab-delimited output
files contain columns that indicate chromosome id, strand direction, representative reads id,
precursor id, mature miRNA location, precursor location, mature sequence, and precursor
sequence. An additional bed file is derived from this file to facilitate further analysis. The
progress_log provide info about finished steps. The script_log and script_err files would
retrieve potential info & warnings of the bash script. A detailed explanation of the parameters
are listed in part 3.3.
>read0_x29909
TTTGGATTGAAGGGAGCTCTA
>read1_x36974
TTCCACAGCTTTCTTGAACTG
>read2_x32635
4
TTCCACAGCTTTCTTGAACTT
Another non-miRNA ncRNA index is also needed to filter out noisy sequences from ncRNA
fragments. The file is a collect of main ncRNA sequences from Rfam, including rRNA, tRNA,
snRNA, and snoRNA. To build this index, please refer to part 2.3, as the index should be placed
and named correctly, i.e. <miRDP2_version>/script/index/rfam_index.
3.3 RUN MIRDP2To use miRDP2 to detecting new miRNAs from deep sequencing data, run the
bash script in the package to start the analysis pipeline (An example can be found in part 2.3):
<path_to_miRDP2_folder>/miRDP2-vx.x_pipeline.bash -g <genome_file> -x
<path_to_index/index_prefix> -f -i <seq_file > -o <output_folder>
There are three parameters for: number of different location a read could map to, allowed
mismatch number for bowtie, reads RPM threshold for reads. Users can modify them using -L,
-M, -N, and -R options. A detailed explanation is in part 5.1.
5
-T/--bowtie2 option can be used to switch to bowtie2 while aligning reads. –large-index option
should be
a. preprocess_reads.pl & preprocess_reads-SAM.pl filters input reads, including reads that are
too long or too short (<19nt or >24nt), and reads correlated with Rfam ncRNA sequences, as
well as reads with Reads Per Million reads (RPM) less than 5. The script then retrieves reads
correlated to known miRNA mature sequences. The input files are fasta format of original
reads files and bowtie output of reads mapping to miRNA and ncRNA sequences.
6
b. convert_bowtie_to_blast.pl & convert_SAM_to_blast.pl changes the bowtie format/SAM
format into blast-parsed format. Blast-parsed format is a custom tabular separated format
derived from standard NCBI blast output format.
e. mod-miRDP.pl needs two input files, signature file and structure file, which is modified from
the core miRDeep-P algorithm by changing the scoring system with plant specific
parameters. The input files are dot-bracket precursor structure file and reads distribution
signature file.
5.1 PARAMETERS
There are several parameters that can be custom-modified:
7
a. The first one is the limit of how many locations could a read map to (-L/--locate option).
Reads map to too many sites are possibly associated with repeat sequences, and are not
likely to related to miRNAs. The default setting is 15.
For specific species, if there are miRNA families with many members, the first parameter
may be increased manually to adapt to the genome landscape.
b. The second one is the length of putative miRNA precursors the program excised
(-N/--length option). The default setting is 300 nt.
c. The third one is allowed mismatches for bowtie/bowtie2 (-M/--mismatch option). The
default setting is 0.
d. The fourth one is the threshold for reads. (-R/--rpm option) To reduce time consumption
and false positive, we filter reads by RPM. Only reads exceeded a certain RPM threshold
may represent mature sequences of miRNAs rather than background noise, and would be
kept for further analysis. The default setting is 10 (RPM).
e. The fifth one is the number of thread allowed for RNAfold (-p/--thread option). The default
setting is 1.
Please be aware that changing these parameters would potentially affect performance and
time consumption. In general, increase of parameter a & c and decrease of parameter d would
generate a loose result and longer running time and vice versa.
8
5.3 LICENSE AND AVAILABILITY
MiRDP2 is freely available under a GNU Public License (Version 3) at:
http://sourceforge.net/projects/mirdp2/
The miRDP2 scripts, demos and user manual can be obtained from the website.
Firstly, we filtered out improper reads in original small RNA libraries, and employed new
strategies to excise the precursors of miRNA candidates. The step of excising miRNA precursors
is one of the most time-consuming steps. After employing the new strategy, the time of
processing this step is dramatically reduced. In addition, the new strategy could improve the
prediction accuracy by removing false positives. In details, we first filtered out reads with
inappropriate length (<19nt or >24nt) since none plant miRNAs are shorter than 19nt or longer
than 24nt as Axtell (Axtell and Meyers, 2018) suggested. In general, these reads count for
around one third of total reads in a typical small RNA library. Second, we only employed reads
either similar with known miRNAs (allow 1 mismatch) or with high copy number (>=5 RPM) to
excise miRNA precursor candidates. These reads take up less than 5% of unique reads in a
typical small RNA library. Taken these improvements together, the reads processed into
excising precursors are much less and more focused, which could dramatically reduce
computational time (10s to 100 times). At the same time, noise caused by improper reads is
filtered out, resulting in the improvement of prediction accuracy.
Secondary, we have introduced the most up-to-date miRNA annotation criteria (Axtell and
Meyers, 2018) in miRDP2 and developed a new criteria of selecting miRNA candidates of 23/24
nt miRNAs. All details are in Supplementary Material 2. This update and change removed many
false-positives and increased the prediction accuracy.
Lastly, we have modified the existing scoring system in miRDeep-P core algorithm to better
fitting with plant miRNA characteristics (longer precursors and more complicated secondary
structure as stated in Yang and Li, 2011). We have allowed longer precursors with bifurcation in
stem loop region, which are usually filtered by other prediction tools including miRDeep-P.
9
Supplementary Material 4 shows two examples (Ath-MIR157c and Ath-MIR858). This change
has much increased the sensitivity of miRDeep-P2.
7. REFERENCES
An, J., Lai, J., Sajjanhar, A., Lehman, M. L., and Nelson, C. C. (2014). miRPlant: an integrated tool
for identification of plant miRNA from RNA sequencing data. BMC bioinformatics, 15, 275-
278.
Axtell M.J. and Meyers, B.C., (2018) Revisiting Criteria for Plant MicroRNA Annotation in the Era
of Big Data, Plant Cell, 30, 272-284.
Bartel, D.P. (2009) MicroRNAs: Target Recognition and Regulatory Functions, Cell, 136, 215-233.
Evers, M., Huttner, M., Dueck, A., Meister, G., and Engelmann, J. C. (2015). miRA: adaptable
novel miRNA identification in plants using small RNA sequencing data. BMC
bioinformatics, 16, 370.
Fahlgren, N., et al. (2007) High-Throughput Sequencing of Arabidopsis microRNAs: Evidence for
Frequent Birth and Death of MIRNA Genes, PLoS One, 2, e219.
Friedlander, M.R., et al. (2008) Discovering microRNAs from deep sequencing data using
miRDeep, Nat Biotechnol, 26, 407-415.
Lei, J., and Sun, Y. (2014). miR-PREFeR: an accurate, fast and easy-to-use plant miRNA
prediction tool using small RNA-Seq data. Bioinformatics, 30, 2837-2839.
Mathelier, A., and Carbone, A. (2010). MIReNA: finding microRNAs with high accuracy and no
learning at genome scale and from deep sequencing data. Bioinformatics, 26, 2226-2234.
Meyers, B.C., et al. (2008) Criteria for annotation of plant MicroRNAs, Plant Cell, 20, 3186-3190.
Wark, A.W., Lee, H.J. and Corn, R.M. (2008) Multiplexed detection methods for profiling
microRNA expression in biological samples, Angew Chem Int Ed Engl, 47, 644-652.
Yang, X. and Li, L. (2011) miRDeep-P: a computational tool for analyzing the microRNA
transcriptome in plants, Bioinformatics, 27, 2614-2615.
Yang, X., Zhang, H. and Li, L. (2011) Global analysis of gene-level microRNA expression in
Arabidopsis using deep sequencing data, Genomics, 98, 40-46.
10
Zhu, Q.H., et al. (2008) A diverse set of microRNAs and microRNA-like small RNAs in developing
rice grains, Genome Res, 18, 1456-1465.
11