You are on page 1of 40

RNA-seq using Galaxy

Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics

Introduction Sequencers: HiSeq Characteristics:
• High throughput. • Paired end. • High accuracy. • Read length 2 × 150bp. • Relatively long run time. • Relatively expensive.

Figure 1: HiSeq 2000.
RNA-seq data analysis 1/20 Monday, 15 October 2012

Introduction Sequencers: Ion Torrent Characteristics:
• Moderate throughput. • Single end (for now). • High accuracy. • Read length ±200bp. • Short run time. • Cheap runs. Figure 2: Ion torrent.

RNA-seq data analysis

2/20

Monday, 15 October 2012

Introduction General layout of an RNA-seq pipeline. 1. Pre-alignment.
• QC. • Data cleaning.

RNA-seq data analysis

3/20

Monday, 15 October 2012

• Data cleaning. Pre-alignment. RNA-seq data analysis 3/20 Monday. 1. • Use a specialised (RNA) aligner.Introduction General layout of an RNA-seq pipeline. Alignment. 15 October 2012 . • QC. 2.

• Known transcripts. Alignment. • QC. Pre-alignment. 3. transcripts) analysis.Introduction General layout of an RNA-seq pipeline. • Use a specialised (RNA) aligner. Expression (gene. 15 October 2012 . • Data cleaning. RNA-seq data analysis 3/20 Monday. 1. 2.

Expression (gene. 3. • New transcripts.Introduction General layout of an RNA-seq pipeline. 1. • Known transcripts. alternative splicing. 4. transcripts) analysis. etc. 15 October 2012 . Alignment. • Data cleaning. • QC. Transcript assembly. RNA-seq data analysis 3/20 Monday. 2. • Use a specialised (RNA) aligner. Pre-alignment.

• Clip low quality reads at the end of the read. RNA-seq data analysis 4/20 Monday. • Judge the part of the read that is left. 15 October 2012 .Pre-alignment FastX / FastQC. We use the Trimmomatic / FastX toolkit for data cleaning. • Remove linker sequences.

15 October 2012 RNA-seq data analysis . We use the Trimmomatic / FastX toolkit for data cleaning. Quality scores distribution. • • • • GC content.Pre-alignment FastX / FastQC. • Remove linker sequences. • Judge the part of the read that is left. • Clip low quality reads at the end of the read. GC distribution. 4/20 Monday.. . The FastQC tool kit is used for quality control (both before and after the data cleaning step)..

15 October 2012 . Figure 3: Per base sequence content.Pre-alignment FastQC report. RNA-seq data analysis 5/20 Monday. Figure 4: Per sequence quality.

Difference with DNA: • Splicing. RNA-seq data analysis 6/20 Monday. 15 October 2012 .Alignment RNA aligners.

Difference with DNA: • Splicing.Alignment RNA aligners. 15 October 2012 . This affects: • Insert sizes. RNA-seq data analysis 6/20 Monday. • Mapping of reads that cover an exon-exon boundary.

• Mapping of reads that cover an exon-exon boundary..Alignment RNA aligners. This affects: • Insert sizes. • MapSplice. • . Available tools: • Tophat. • HMMSplicer. Difference with DNA: • Splicing.. • PASSion. • Gmap / Gsnap. 15 October 2012 . RNA-seq data analysis 6/20 Monday.

• Some tools prefer splitting reads over mapping them in an intron. • Some tools find exons first. If you work with pre-mRNA. RNA-seq data analysis 7/20 Monday.Alignment Choose your aligner carefully. the options are limited. then use this to break up reads. 15 October 2012 .

15 October 2012 .Alignment Choose your aligner carefully. If you work with pre-mRNA. the options are limited. • Some tools prefer splitting reads over mapping them in an intron. then use this to break up reads. Some tools heavily rely on annotation. RNA-seq data analysis 7/20 Monday. • A list of known splice sites. • Some tools find exons first. • Motives (canonical splice sites).

15 October 2012 . Gsnap: Genomic Short-read Nucleotide Alignment Program. Gmap: A Genomic Mapping and Alignment Program for mRNA and EST Sequences.gene.Alignment Gmap. http://research-pub.com/gmap/ RNA-seq data analysis 8/20 Monday.

Gsnap: Genomic Short-read Nucleotide Alignment Program. • No limit on intron size. 15 October 2012 . http://research-pub. Gmap: A Genomic Mapping and Alignment Program for mRNA and EST Sequences. • Split a read into many pieces.Alignment Gmap.com/gmap/ RNA-seq data analysis 8/20 Monday.gene. Some features: • Split read alignment. • Fast. • Split both ends. • Memory efficient.

• Gmap / Gsnap.umd.Expression analysis and transcript assembly Cufflinks. http://cufflinks.edu/ RNA-seq data analysis 9/20 Monday.cbcb. Input: • Aligned reads. • Tophat. 15 October 2012 .

umd.edu/ RNA-seq data analysis 9/20 Monday. Input: • Aligned reads. • Tophat. http://cufflinks. • Gmap / Gsnap. 15 October 2012 . • Estimated transcript abundance.cbcb.Expression analysis and transcript assembly Cufflinks. What it can do: • Assembled transcripts.

edu/ RNA-seq data analysis 9/20 Monday. Differential expression and regulation (cuffcompare). • Tophat. • Gmap / Gsnap. 15 October 2012 .cbcb.umd. Input: • Aligned reads.Expression analysis and transcript assembly Cufflinks. http://cufflinks. What it can do: • Assembled transcripts. • Estimated transcript abundance.

15 October 2012 . • Assemble transcripts with no prior knowledge. • Assemble transcripts assisted by known transcripts. Modes of operation: • Use predefined transcripts.edu/ RNA-seq data analysis 10/20 Monday.umd.Expression analysis and transcript assembly Cufflinks.cbcb. http://cufflinks.

• Assemble transcripts assisted by known transcripts. When to use: • Only interested in expression.cbcb.umd. Modes of operation: • Use predefined transcripts. • Alternative splicing.edu/ RNA-seq data analysis 10/20 Monday.Expression analysis and transcript assembly Cufflinks. 15 October 2012 . • Assemble transcripts with no prior knowledge. http://cufflinks.

RNA-seq data analysis 11/20 Monday. 15 October 2012 .Variant calling Principle of variant calling Figure 5: Result of an alignment.

we call a variant when we are confident we have seen one. RNA-seq data analysis 12/20 Monday.Variant calling Principle of variant calling In principle. 15 October 2012 .

But when are we confident? • More than x times? • In more than y percent of the reads covering the variant? RNA-seq data analysis 12/20 Monday. we call a variant when we are confident we have seen one.Variant calling Principle of variant calling In principle. 15 October 2012 .

RNA-seq data analysis 12/20 Monday. we call a variant when we are confident we have seen one. 15 October 2012 . • Statistical models.Variant calling Principle of variant calling In principle. But when are we confident? • More than x times? • In more than y percent of the reads covering the variant? Variant callers can use: • Fixed settings.

RNA-seq data analysis 13/20 Monday. • Distribution within the reads. • Base quality. 15 October 2012 .Variant calling Some considerations Things a variant caller might take into account: • Strand balance. • Ploidity of the organism in question. • Mapping quality.

15 October 2012 . RNA-seq data analysis 13/20 Monday.Variant calling Some considerations Things a variant caller might take into account: • Strand balance. • Distribution within the reads. • Heterozygosity may not be detected. • Ploidity of the organism in question. • Base quality. • Mapping quality. Some complications when analysing RNA: • Allele specific expression.

• Heterozygosity may not be detected. • Distribution within the reads. • Mapping quality. • Tissue specific expression. • Base quality. • Ploidity of the organism in question. • Some variants will be missed completely.Variant calling Some considerations Things a variant caller might take into account: • Strand balance. RNA-seq data analysis 13/20 Monday. 15 October 2012 . Some complications when analysing RNA: • Allele specific expression.

• Heterozygosity may not be detected. • Some variants will not be present on DNA. • Base quality. • RNA editing. 15 October 2012 . • Some variants will be missed completely. • Mapping quality. • Distribution within the reads. RNA-seq data analysis 13/20 Monday. • Tissue specific expression. Some complications when analysing RNA: • Allele specific expression. • Ploidity of the organism in question.Variant calling Some considerations Things a variant caller might take into account: • Strand balance.

• Distribution within the reads. • Mapping quality. • Some variants will not be present on DNA. • Ploidity of the organism in question. • Tissue specific expression. • Base quality. Some complications when analysing RNA: • Allele specific expression. RNA-seq data analysis 13/20 Monday. • Heterozygosity may not be detected.Variant calling Some considerations Things a variant caller might take into account: • Strand balance. • Strand specific sampleprep. • RNA editing. 15 October 2012 . • Some variants will be missed completely.

bam $ i . sam Listing 1: Shell script. 15 October 2012 . sam samtools view −bt $ r e f e r e n c e −o $ i . s a i bwa samse $ r e f e r e n c e $ i . 1 2 3 bwa aln − t 8 $ r e f e r e n c e $ i > $ i . s a i $ i > $ i . RNA-seq data analysis 14/20 Monday.Pipelines Combining tools in a pipeline.

15 October 2012 . f q $ (BWA) samse $ ( c a l l MKREF. RNA-seq data analysis 14/20 Monday. $@ ) $ ˆ > $@ %. 1 2 3 4 5 6 7 8 %. s a i : %.Pipelines Combining tools in a pipeline.sam $ (SAMTOOLS) view − bt $ ( c a l l MKREF. s a i bwa samse $ r e f e r e n c e $ i . s a i %. 1 2 3 bwa aln − t 8 $ r e f e r e n c e $ i > $ i . $@ ) −o $@ $< Listing 2: Makefile. sam Listing 1: Shell script. sam samtools view −bt $ r e f e r e n c e −o $ i . $@ ) $< > $@ %.bam : %. f q $ (BWA) aln − t $ (THREADS) $ ( c a l l MKREF. s a i $ i > $ i . bam $ i .sam : %.

psu.nl/ RNA-seq data analysis 15/20 Monday. 15 October 2012 . • Open source.Galaxy Overview. Data intensive biology for everyone. • No installation required.edu/ http://galaxy. • Web based.nbic. http://galaxy.

Galaxy Overview. http://galaxy.. Share your workflow with other people. • • • • Save all the steps you did in your analysis. User friendly.nl/ RNA-seq data analysis 15/20 Monday.psu.nbic. • Open source. • Web based. 15 October 2012 . • No installation required. Data intensive biology for everyone. Rerun the entire analysis on a new dataset. Point and click.. .edu/ http://galaxy. • • • • Wrapper for command line utilities. Workflows.

RNA-seq data analysis 16/20 Monday. Figure 6: Galaxy panels. 15 October 2012 .Galaxy The Galaxy GUI.

• Eye: view. RNA-seq data analysis 17/20 Monday. • Cross: delete. Figure 7: Collapsed history item. • Pencil: edit (rename). 15 October 2012 .Galaxy Galaxy icons. • Click on the title for a more detailed view.

RNA-seq data analysis 18/20 Monday.Galaxy Galaxy icons. • Blue looping arrow: rerun. 15 October 2012 . • Diskette: save. Figure 8: History item.

3.Galaxy Outline of the practical 1. Workflows. 2. Variant calling. RNA-seq data analysis 19/20 Monday. • Expression. Do a typical RNA-seq analysis. • Rerun the analysis with no effort. 15 October 2012 . • Novel transcripts.

Questions? Acknowledgements: Hailiang Mei Michiel van Galen Martijn Vermaat Johan den Dunnen RNA-seq data analysis 20/20 Monday. 15 October 2012 .