You are on page 1of 13

A user’s manual for

miRDP2

Last updated: 21th-April-2020


Current as of miRDP2 version 1.1.4

Zheng Kuang
Xiaozeng Yang
Beijing Academy of Agriculture and
Forestry Sciences
The Peking University
Contact: miRDP2@sRNAworld.com
Contents
1. OVERVIEW ........................................................................................................................................... 1

1.1 BACKGROUND ............................................................................................................................... 1

1.2 SUMMARY OF MIRDP2 FUNCTION ................................................................................................. 2

1.3 IMPLEMENTATION AND ALGORITHM ............................................................................................. 2

2. INSTALLATION ..................................................................................................................................... 2

2.1 DEPENDENCIES .............................................................................................................................. 2

2.2 DOWNLOAD................................................................................................................................... 3

2.3 TEST ............................................................................................................................................... 3

3. DETECTING NEW MIRNAS .................................................................................................................... 4

3.1 FORMATTING READS ..................................................................................................................... 4

3.2 BUILD INDEX .................................................................................................................................. 5

3.3 RUN MIRDP2.................................................................................................................................. 5

3.4 MIRDP2 OUTPUT ........................................................................................................................... 6

4. THE CONTENTS OF MIRDP2 SOFTWARE PACKAGE ............................................................................... 6

5. ISSUES USING MIRDP2 ......................................................................................................................... 7

5.1 PARAMETERS ................................................................................................................................. 7

5.2 REDUNDANCY AND MIRNA* .......................................................................................................... 8

5.3 LICENSE AND AVAILABILITY ............................................................................................................ 9

6. APPENDIX - MAJOR UPDATES OF MIRDP2 ............................................................................................ 9

7. REFERENCES ...................................................................................................................................... 10
1. OVERVIEW

1.1 BACKGROUND
MicroRNAs (miRNAs) are ~21-nucleotide endogenous small RNAs (sRNAs) with potent roles in
regulating gene expression (Bartel, 2009). In the past two decades, extensive research efforts
have been devoted to identify miRNAs and study their functions, especially after the NGS
methods became available. Based on such unique features of miRNAs as stem-loop structure
and preferential accumulation of sequence reads corresponding to mature and star miRNAs,
computational tools capturing these characteristics have achieved stunning successes in
identifying miRNAs in diverse species. In the public miRNA repository miRBase, over 38,000
miRNA items are currently hosted (version 22) whereas only ~500 were stored in 2008 (version
2.0; Kozomara et al., 2014).

Previously, we have developed miRDeep-P for miRNA prediction in plant species (Yang and Li,
2011). However, miRDeep-P has shown two major drawbacks when facing complicated input
datasets, which would potentially dampen its significance in plant miRNA prediction. One is the
long running time when working on complex genomes or libraries with high sequencing depth.
The other is the relatively large amount of false positives mingling with true miRNAs, which may
severely impact subsequent analysis.

To cope with these shortcomings, we have incorporated new plant miRNA annotation criteria
(Axtell and Meyers, 2018) and overhauled the strategies and algorithm of miRDeep-P, which
lead to a significantly improved version, designated miRDeep-P2 (miRDP2). Compared to other
miRNA prediction tools, including MIReNA (Mathelier and Carbone, 2010), miRPlant (An et al.,
2014), miRPERFeR (Lei and Sun, 2014), and miRA (Evers et al., 2015), the time consumption,
sensitivity, and accuracy of miRDP2 have much advantage (details in manuscript and
supplementary materials).

1
1.2 SUMMARY OF MIRDP2 FUNCTION
Based on ultra-deep sampling of small RNA libraries by next generation sequencing, miRDP2 is
able to identify miRNA genes in plant species, even for those without detailed annotation, with
extremely high speed and reliable performance.

1.3 IMPLEMENTATION AND ALGORITHM


MiRDP2 is documented by Perl (Perl 5.8 or later versions) and makes use of fundamental
packages from Perl library. All the scripts have been tested on two Linux platforms, including
CentOS release 6.5 on a cluster server, and Cygwin 2.6.0 on PC Windows system, and should
work on similar systems that support Perl.

The basic algorithm framework of miRDP2 was inherited from miRDeep-P (Yang and Li, 2011),
while several critical modifications and novel assistant scripts have been added to the original
tool.

2. INSTALLATION

2.1 DEPENDENCIES
To run miRDP2, several dependencies are required.

First, the Bowtie or Bowtie2 should be downloaded from the site:


http://bowtie-bio.sourceforge.net/index.shtml.

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Second, the Vienna package should be downloaded from the site:


http://www.tbi.univie.ac.at/~ivo/RNA/.

2
2.2 DOWNLOAD
To install the miRDP2 package, simply download the two tar ball files from
https://sourceforge.net/projects/mirdp2/files/latest_version/ and extract all the contents into
one folder.

2.3 TEST
To test whether miRDP2 has been correctly installed, the user can use the test data and the
following commands to check the availability of miRDP2. The test data and the expected output
can be found in https://sourceforge.net/projects/mirdp2/files/TestData/.

The test data contains one formatted GSM sequencing file and one Arabidopsis thaliana
genome file. To test miRDP2, please follow this guide (here we use bowtie for demonstration):

mv miRDP2-v*.tar.gz TestData.tar.gz ncRNA_rfam.tar.gz <user_selected_folder>


cd <user_selected_folder>
tar -xvzf miRDP2-v*.tar.gz
tar -xvzf TestData.tar.gz
tar -xvzf ncRNA_rfam.tar.gz
bowtie-build -f ./TestData/TAIR10_genome.fa ./TestData/TAIR10.genome
bowtie-build -f ./ncRNA_rfam.fa ./1.1.*/script/index/rfam_index
(Using bowtie2-build if you prefer to use bowtie2 in the later analysis)
bash ./1.1.4/miRDP2-v1.1.4_pipeline.bash -g ./TestData/TAIR10_genome.fa -x ./
TestData/TAIR10_genome -f -i ./TestData/GSM2094927.fa -o .

(add option`-T’ or `--bowtie2’ if you prefer to use bowtie2 for reads alignment)

3
The bowtie-build command may take a while, and the miRDP2 pipeline would finish within
several minutes. A folder named ‘GSM2094927-15-0-10’ should be automatically generated in
<user_selected_folder>, containing all intermediate files and results. GSM2094927-15-0-
10_filter_P_prediction is the final output of predicted miRNAs. The file is tab-delimited output
files contain columns that indicate chromosome id, strand direction, representative reads id,
precursor id, mature miRNA location, precursor location, mature sequence, and precursor
sequence. An additional bed file is derived from this file to facilitate further analysis. The
progress_log provide info about finished steps. The script_log and script_err files would
retrieve potential info & warnings of the bash script. A detailed explanation of the parameters
are listed in part 3.3.

3. DETECTING NEW MIRNAS

3.1 FORMATTING READS


Before run the pipeline, the input reads must be preprocessed into proper format. First, the
deep sequencing reads should have the adapters removed from 5' and 3' ends (if present).
Second, the deep sequencing reads must be parsed into FASTA format. Third, redundancy
should be removed such that reads with identical sequence are represented with a single
FASTA entry. Therefore, each sequence identifier must end with a '_x' and an integer, with the
integer indicating the number of times the exact sequence was retrieved in the deep
sequencing dataset. Finally, all of the FASTA ids should be unique. One way to ensure this is to
include a running number in the id. For reference, see the file, GSM2094927.fa, in the test data
(https://sourceforge.net/projects/mirdp2/files/TestData/). The following are several examples:

>read0_x29909
TTTGGATTGAAGGGAGCTCTA
>read1_x36974
TTCCACAGCTTTCTTGAACTG
>read2_x32635

4
TTCCACAGCTTTCTTGAACTT

3.2 BUILD INDEX


To save time, you may want to download bowtie index files from the bowtie or bowtie2
website (http://bowtie-bio.sourceforge.net/index.shtml; http://bowtie-
bio.sourceforge.net/bowtie2/index.shtml) if the genome sequences of the species you are
working with have been indexed. Otherwise, you should index reference sequences by yourself.
Please keep the index file for a while till you have finish your project since you might need to re-
index your genome.

Another non-miRNA ncRNA index is also needed to filter out noisy sequences from ncRNA
fragments. The file is a collect of main ncRNA sequences from Rfam, including rRNA, tRNA,
snRNA, and snoRNA. To build this index, please refer to part 2.3, as the index should be placed
and named correctly, i.e. <miRDP2_version>/script/index/rfam_index.

3.3 RUN MIRDP2To use miRDP2 to detecting new miRNAs from deep sequencing data, run the
bash script in the package to start the analysis pipeline (An example can be found in part 2.3):

<path_to_miRDP2_folder>/miRDP2-vx.x_pipeline.bash -g <genome_file> -x
<path_to_index/index_prefix> -f -i <seq_file > -o <output_folder>

Please note the version of the pipeline bash script.

There are three parameters for: number of different location a read could map to, allowed
mismatch number for bowtie, reads RPM threshold for reads. Users can modify them using -L,
-M, -N, and -R options. A detailed explanation is in part 5.1.

5
-T/--bowtie2 option can be used to switch to bowtie2 while aligning reads. –large-index option
should be

3.4 MIRDP2 OUTPUT


The output folder would be automatically generated under <output_folder>, and named as
`<seq_file_name>’. The file <seq_file_name>_filter_P_prediction contains information of the
final predicted miRNAs. The tab-delimited columns in this file are chromosome id, strand
direction, representative reads id, precursor id, mature miRNA location, precursor location,
mature sequence, and precursor sequence, separately. A bed file is also provided for
subsequent analysis.

4. THE CONTENTS OF MIRDP2 SOFTWARE PACKAGE


The miRDP2 package consists of six documented Perl scripts that should be run sequentially by
the prepared bash script. Of the six scripts, three, convert_bowtie_to_blast.pl,
filter_alignments.pl, and excise_candidate.pl, are inherited from miRDeep-P (Yang and Li, 2011).
The other scripts are modified from the original version. Functions of the six scripts are
described in the following:

a. preprocess_reads.pl & preprocess_reads-SAM.pl filters input reads, including reads that are
too long or too short (<19nt or >24nt), and reads correlated with Rfam ncRNA sequences, as
well as reads with Reads Per Million reads (RPM) less than 5. The script then retrieves reads
correlated to known miRNA mature sequences. The input files are fasta format of original
reads files and bowtie output of reads mapping to miRNA and ncRNA sequences.

The formula for calculating RPM is as the following:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 𝑚𝑎𝑝𝑝𝑒𝑑 𝑡𝑜 𝑎 𝑚𝑖𝑅𝑁𝐴 (𝑚𝑎𝑡𝑢𝑟𝑒 𝑝𝑎𝑟𝑡 ) × 10


RPM of a miRNA =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑓𝑟𝑜𝑚 𝑔𝑖𝑣𝑒𝑛 𝑙𝑖𝑏𝑟𝑎𝑟𝑦

6
b. convert_bowtie_to_blast.pl & convert_SAM_to_blast.pl changes the bowtie format/SAM
format into blast-parsed format. Blast-parsed format is a custom tabular separated format
derived from standard NCBI blast output format.

c. filter_alignments.pl filters the alignments of deep sequencing reads to a genome. It filters


partial alignments as well as multi-aligned reads (user-specified frequency cutoff). The basic
input is a file in blast-parsed format.

d. excise_candidate.pl cuts out potential precursor sequences from a reference sequence


using aligned reads as guidelines. The basic input is a file in blast-parsed format and a FASTA
file. The output is all potential precursor sequences in FASTA format.

e. mod-miRDP.pl needs two input files, signature file and structure file, which is modified from
the core miRDeep-P algorithm by changing the scoring system with plant specific
parameters. The input files are dot-bracket precursor structure file and reads distribution
signature file.

f. mod-rm_redundant_meet_plant.pl needs three input files: chromosome_length, precursors


and original_prediction generated by mod-miRDP.pl. It generates two output files, non-
redundant predicted file and predicted file filtered by plant criteria. The two tab-delimited
output files contain columns that indicate chromosome id, strand direction, representative
reads id, precursor id, mature miRNA location, precursor location, mature sequence, and
precursor sequence.

5. ISSUES USING MIRDP2

5.1 PARAMETERS
There are several parameters that can be custom-modified:

7
a. The first one is the limit of how many locations could a read map to (-L/--locate option).
Reads map to too many sites are possibly associated with repeat sequences, and are not
likely to related to miRNAs. The default setting is 15.
For specific species, if there are miRNA families with many members, the first parameter
may be increased manually to adapt to the genome landscape.

b. The second one is the length of putative miRNA precursors the program excised
(-N/--length option). The default setting is 300 nt.

c. The third one is allowed mismatches for bowtie/bowtie2 (-M/--mismatch option). The
default setting is 0.

d. The fourth one is the threshold for reads. (-R/--rpm option) To reduce time consumption
and false positive, we filter reads by RPM. Only reads exceeded a certain RPM threshold
may represent mature sequences of miRNAs rather than background noise, and would be
kept for further analysis. The default setting is 10 (RPM).

e. The fifth one is the number of thread allowed for RNAfold (-p/--thread option). The default
setting is 1.

Please be aware that changing these parameters would potentially affect performance and
time consumption. In general, increase of parameter a & c and decrease of parameter d would
generate a loose result and longer running time and vice versa.

5.2 REDUNDANCY AND MIRNA*


In some cases, the output miRNAs from miRDP2 may differ from the known miRNAs. We found
that this is mainly due to one of two reasons: heterogeneity of the mature miRNAs or the
relative abundance of miRNA and miRNA*. We found that this does not impact the optimal
length selection of precursors and the profiling of known miRNA genes.

8
5.3 LICENSE AND AVAILABILITY
MiRDP2 is freely available under a GNU Public License (Version 3) at:

http://sourceforge.net/projects/mirdp2/

The miRDP2 scripts, demos and user manual can be obtained from the website.

6. APPENDIX - MAJOR UPDATES OF MIRDP2


Our modifications include filtering of input reads, incorporating latest miRNA annotation
criteria, and removing restriction on bifurcation of secondary structure of miRNA precursor.

Firstly, we filtered out improper reads in original small RNA libraries, and employed new
strategies to excise the precursors of miRNA candidates. The step of excising miRNA precursors
is one of the most time-consuming steps. After employing the new strategy, the time of
processing this step is dramatically reduced. In addition, the new strategy could improve the
prediction accuracy by removing false positives. In details, we first filtered out reads with
inappropriate length (<19nt or >24nt) since none plant miRNAs are shorter than 19nt or longer
than 24nt as Axtell (Axtell and Meyers, 2018) suggested. In general, these reads count for
around one third of total reads in a typical small RNA library. Second, we only employed reads
either similar with known miRNAs (allow 1 mismatch) or with high copy number (>=5 RPM) to
excise miRNA precursor candidates. These reads take up less than 5% of unique reads in a
typical small RNA library. Taken these improvements together, the reads processed into
excising precursors are much less and more focused, which could dramatically reduce
computational time (10s to 100 times). At the same time, noise caused by improper reads is
filtered out, resulting in the improvement of prediction accuracy.

Secondary, we have introduced the most up-to-date miRNA annotation criteria (Axtell and
Meyers, 2018) in miRDP2 and developed a new criteria of selecting miRNA candidates of 23/24
nt miRNAs. All details are in Supplementary Material 2. This update and change removed many
false-positives and increased the prediction accuracy.

Lastly, we have modified the existing scoring system in miRDeep-P core algorithm to better
fitting with plant miRNA characteristics (longer precursors and more complicated secondary
structure as stated in Yang and Li, 2011). We have allowed longer precursors with bifurcation in
stem loop region, which are usually filtered by other prediction tools including miRDeep-P.

9
Supplementary Material 4 shows two examples (Ath-MIR157c and Ath-MIR858). This change
has much increased the sensitivity of miRDeep-P2.

7. REFERENCES
An, J., Lai, J., Sajjanhar, A., Lehman, M. L., and Nelson, C. C. (2014). miRPlant: an integrated tool
for identification of plant miRNA from RNA sequencing data. BMC bioinformatics, 15, 275-
278.

Axtell M.J. and Meyers, B.C., (2018) Revisiting Criteria for Plant MicroRNA Annotation in the Era
of Big Data, Plant Cell, 30, 272-284.

Bartel, D.P. (2009) MicroRNAs: Target Recognition and Regulatory Functions, Cell, 136, 215-233.

Evers, M., Huttner, M., Dueck, A., Meister, G., and Engelmann, J. C. (2015). miRA: adaptable
novel miRNA identification in plants using small RNA sequencing data. BMC
bioinformatics, 16, 370.

Fahlgren, N., et al. (2007) High-Throughput Sequencing of Arabidopsis microRNAs: Evidence for
Frequent Birth and Death of MIRNA Genes, PLoS One, 2, e219.

Friedlander, M.R., et al. (2008) Discovering microRNAs from deep sequencing data using
miRDeep, Nat Biotechnol, 26, 407-415.

Lei, J., and Sun, Y. (2014). miR-PREFeR: an accurate, fast and easy-to-use plant miRNA
prediction tool using small RNA-Seq data. Bioinformatics, 30, 2837-2839.

Mathelier, A., and Carbone, A. (2010). MIReNA: finding microRNAs with high accuracy and no
learning at genome scale and from deep sequencing data. Bioinformatics, 26, 2226-2234.

Meyers, B.C., et al. (2008) Criteria for annotation of plant MicroRNAs, Plant Cell, 20, 3186-3190.

Wark, A.W., Lee, H.J. and Corn, R.M. (2008) Multiplexed detection methods for profiling
microRNA expression in biological samples, Angew Chem Int Ed Engl, 47, 644-652.

Yang, X. and Li, L. (2011) miRDeep-P: a computational tool for analyzing the microRNA
transcriptome in plants, Bioinformatics, 27, 2614-2615.

Yang, X., Zhang, H. and Li, L. (2011) Global analysis of gene-level microRNA expression in
Arabidopsis using deep sequencing data, Genomics, 98, 40-46.

10
Zhu, Q.H., et al. (2008) A diverse set of microRNAs and microRNA-like small RNAs in developing
rice grains, Genome Res, 18, 1456-1465.

11

You might also like