You are on page 1of 53

Analysis of RNA-Seq Data with R/Bioconductor

...
Thomas Girke

December 14, 2013

Analysis of RNA-Seq Data with R/Bioconductor

Slide 1/53

Overview

RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Differential Exon Usage

References

Analysis of RNA-Seq Data with R/Bioconductor

Slide 2/53

Outline

Overview

RNA-Seq Analysis
Aligning Short Reads
Counting Reads per Feature
DEG Analysis
GO Analysis
View Results in IGV & ggbio
Differential Exon Usage

References

Analysis of RNA-Seq Data with R/Bioconductor

Overview

Slide 3/53

RNA-Seq Technology
Sample 1

Sample 2

1. mRNA
Isolation
2. Illumina
Sequencing

Sample 1

Sample 2
3. Align Sequences
against Genome

Gene A

Gene B

Gene A

Gene B

4. Generate Sequence Counts
for all Genes in Genome
30
Gene A:
= 3 fold change
10
10
Gene B:
= 2 fold change
5

Analysis of RNA-Seq Data with R/Bioconductor

Overview

Slide 4/53

Analysis Workflow of RNA-Seq Gene Expression Data
1. Alignment of RNA reads to reference
Reference can be genome or transcriptome.

2. Count reads overlapping with annotation features of interest
Most common: counts for exonic gene regions, but many viable
alternatives exist here: counts per exons, genes, introns, etc.

3. Normalization
Main adjustment for sequencing depth and compositional bias.

4. Identification of Differentially Expressed Genes (DEGs)
Identification of genes with significant expression differences.
Identification of expressed genes possible for strongly expressed ones.

5. Specialty applications
Splice variant discovery (semi-quantitative), gene discovery, antisense
expressions, etc.

6. Cluster Analysis
Identification of genes with similar expression profiles across many
samples.

7. Enrichment Analysis of Functional Annotations
Gene ontology analysis of obtained gene sets from steps 5-6.
Analysis of RNA-Seq Data with R/Bioconductor

Overview

Slide 5/53

exons Alternative splicing Often restricted to splice junction analysis Objective: discovery vs. transcript models.Important Aspects in RNA-Seq Analysis Alignment reference Genome Transcript models Both How to quantify expression? Read count per range Coverage statistics per range What features? Genes. quantification Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 6/53 .

Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 7/53 . e. However.Important Considerations for NGS Alignments In NGS we usually want to find the origin of reads (NG sequences) in a reference genome or transcriptome. for certain applications one needs to include them. we are mostly interested in finding the best scoring or multiple best scoring locations for each read.g. but not lower scoring alternative solutions as in paralog/ortholog search applications. when mapping RNA-Seq reads against transcript sequences instead of genome. Ambiguous mappings should be removed. because there is no evidence for their origin. Thus.

.. Variant tolerant aligners to account for mismatches and indels VAR-Seq Bis-Seq (without injected reference) .Short Read Aligner for RNA-Seq No special requirements for alignments with low number of variants ChIP-Seq RNA-Seq (if mapping against transcriptome or intron-less genome) Bis-Seq (with injected reference) .. Splice tolerant aligner to account for introns RNA-Seq (if mapping against genome with introns) Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 8/53 ..

* NM:i:1 Link Overview Slide 9/53 .. The below sample alignment contains the following features: (1) bases in lower cases are clipped from the alignment.9. (2) read r001/1 and r001/2 constitute a read pair.1... Coor ref +r001/1 +r002 +r003 +r004 -r003 -r001/2 12345678901234 5678901234567890123456789012345 AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT TTAGATAAAGGATA*CTG aaaAGATAA*GGATA gcctaAGCTAA ATAGCT.29.TCAGC ttagctTAGGC CAGCGGCAT ⇓ SAM Format r001 163 ref 7 30 8M2I4M1D3M r002 0 ref 9 30 3S6M1P1I4M r003 0 ref 9 30 5S6M r004 0 ref 16 30 6M14N5M r003 2064 ref 29 17 6H5M r001 83 ref 37 30 9M = 37 39 * 0 0 * 0 0 * 0 0 * 0 0 = 7 -39 For details see the SAM Format Specification Analysis of RNA-Seq Data with R/Bioconductor TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT * * * SA:Z:ref..30.0..17..+. * * SA:Z:ref.Sequence Alignment/Map (SAM/BAM) Format SAM is a tab-delimited alignment format consisting of a header section (lines starting with @) and an alignment section with 12 columns.6H5M. (4) r004 represents a split alignment..... BAM is the compressed...5S6M. indexed and binary version of this format.. (3) r003 is a chimeric read.-.

4 log2(Kidney1 NK1) − log2(Kidney2 NK2) -6 -4 -2 0 2 4 6 log2(Liver NL) . 2010).8 (a) q q q q q q q q q q q q q q q q qq -2 0 2 4 6 0. Kidney NK ) Log ratio distributions (a and b) and MA plot (c) for two tissue samples (from Robinson and Oshlack.0 Density (b) 0.log2(Kidney NK) 0.0 Density q qqqqqqq qqqqqq q qqq qqqqq q qqqqqqqq qqqqqq qqqqqqq qqqqqqq qqqqqqq qqqq qq qqqqqqq qqq qqqqqqq qqqqqq q q q qq q qqqqqqq qqqqqqq q q qqqqq q qqq q qqqqqqq q qqqqqqq q qqqqq q qq q q qqqq qq qqq q q qqqqq qq q q q qqq qq q qqq qq q qqqqqq qq q qqqq qq qqq q q qq q q qq q q qq qqq q q q q q qq qqqqqq q qqq q qq q qqqqqqq q qqqqq qq qqqq qq q q qqqqqqqq q qq q q qqqqqq qqqqqq qqq qqq q q qq qq qq q q q q qqq qqqq qq qq q q q qqq q q qq q qqqqq qqq q q q q qq qqqqqq q qq qq q qqq q q q q q q q q qqq q q q q qq q qq q qq q qq q q qq qq q q q qq qq q q qq q q q qq q qq qq qq q q q q q q qq q q q q qqq q q q q qq q q q q qq q q q qq q qq q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q qq q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q qq q qq q q q q q qq qq q q q q q q q qq q q q q q q q q qq q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q q q q qq q q q q qq q qq q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq qq q qq q qq q q qqq qq qq q q q q qq q q qq q q q q qq q q q qq q q q q q q q q q qqq q q q q q q q q qq q q q q q q q q q qqq q qqqqq q qq q q q q q qq q q q q qq q q q q q q qq q q q q q qq q q qq q q qq q q q q qq qqq qq q q q q q q q q q qq q q qq qqq q q q q q q q q qqqq q q q q q q qq q q q qq q qq q q q q q q q q q q q q qqqqq qq qq q qqqq q qq q q qq q q q qq q q q q q q q q q q qqq q q qq qqq qq q qq q q q q q q q q q q q qq q q q q q q q q q q qqqq qqq q qqqqqqqqq q q q q q q q q qqq q q q q q q q q qq q q q q q q qq q qq q q q q qq q q qqq q q q q q q qq q qq q qqq q q q q q q q q qqqqqq q q q qqq qqq qqq q q q q qq qq qqq q q q q q q qq q q q q qq q q q q q qq q q q qq q qq q qqq qq q qqqq qq q qqqq q qq q q q q q q q q qq qqqqqqqqqqqqq q qqq qqqqqqqq qqq q q q q q qq qq qq q q q q q qq q qq q q q qqq q q qq q q q qqqqq qqqqqqqqqq qqqq qqq q q qqq q q q qq q q q q q q q qq qqqqqqqqq qqqqq qqqqqqqqq qqqqqq qq qq q q q qqqqqqqqqq q qqqq qqqqq q q qq q q qq q qq qqq q q q qq q q q q q qq qq q q q q q qq q q qq qqqqq qq q qqq qqq q qq qqq q q qq qqqqqqqqq qq qqqqq qq qqqqqqqqq qqq q qqq q q q qq q q q q qq q q qqqqqq q qqq q qq q qqq q q q q qq q qq qq qqq qq q q qq q q q q q q qq q q q qq qqqq qqqq qqq qq qqqqqqqqqqqqqqqq qq qqqqqqq q qq qq qqqqqq qqqqqqqqqqq qqqq q q q qq qq q q q q q qqq qqqq qqqqqqqqqq qqq qqq q qq qqq q q q q q q q q qq qq q q q q q q qq qqqq qqqq qqqqq qqqqqq qqq q q qqqqq qqqqqqqq qqqqqqqqqqqqqq qqq qqq qq qqq qq q q q qqq q q q q q q qq q qq qqqqqq q qq q q qqqqqqq qq qqqqqqqqq qqqqqqqqqqq qqq q q q q q q q qqq q q qq q qq q q q q q q qq q qqqqqq qqqqqq qqqqqqqqqq qq q q q q q qq qqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq qqq q qqq qqq q qqqqqqqq qqqqqqqqqq qq q qqqqqqqqqq qq q qqqq q qq q q q q q qqqqqqqqq qq q qq qq qqq qq q q q q q q qqqqqqqqqqqqq qq qqqqqqqqqqqqqqqq q qqq q q q q q q q q qq q q q q q q q q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqq qqq q q qq q qq qq qq q qqqqq q q q q q qq q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q q q q qqqqq q qqqqq qq qqqqqqq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq q qq q q qqqq qqqqq qq q qqq qq q q q q q qq q q q qqqqqqqqqqqqqqqqqqqqqq q qqqq q q q qqqqq qqqqqqqqqqqqqqqqq qq q q qq qq qq qqq qq q q qqq q q q q qqqqq qqqqqqqqqqqqqqqqqqqq q q q q q qqq q q qqqq qqqqqqqq q q q qqqqqqqqqqqqqqqqqqqqqqqq qqqqqq qqq qq q q q q q q q q qqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqq qq q q q q q q qq q qq q q q qq q qq q q q q qq q qq q q q qqqq qqqq qq q qqqqq qqqqqqqqqqqqqqqqqqqqqqqq q qq q q q q q qq qqq q qqq qqqq qqq q q qqq qq qqqq qqqqq qqqq q q q qq qq qqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q qqqq q qqqqqqqqqqqqqqqq qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqq qq q q qqqq qqqqq qqq q q qqqq qqqqq qqq q q q q qqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qqq qq q q q qqqqqqqqqqqqqqq qq qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q q q q qqq qqq qqqqqq qqqqqqqqqqqq q q q q q qq qq qqqq q qq q qq q qqqqqqqqqqqqq qq q q q q qqqqq qqqqqqqqqqqqqqqqqqqqqqqqqq q qqq q qq qqq q q q q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqq qqqq qqqqqqqqqqq qq q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q q qqqq qq q q q q q qq qqqqqqqqqq qqqqq q q qq q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q q q qqq qqqqqq q q q qqqq qqq qqqqqqqqqq qqqqqqqq q q q q qq q q qq q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q qq qq qqqqqqqqqqqqqqqqqq qqqqq q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq qq q q q q qq q qq q qq qqqqq qqqqqqqq qqqqqqqq q qq qqq qqqq q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q q q q qq qqqqqqqqqqqqqqqqqqq qq qq qq q q qqq q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q qq qqqqqqqqqq q q q q q qq q qq q qqqq qqqqqqqqqqqq qq q q qqq qqqqqqqqqqqqqqq qqqqqqq q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qq q q qqq qq qqqqq q q q q qqqq qq q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q qqqqqqqqqqqqqq qqqq q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqqqqq qqq q qqq q qq qqqqqqqqqqqqqqqqqqqqq qqq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qqqqqqqqqqqqqqqqqqqqq qq q q qqqq qqqqqqqqq q q q q q q q q q q q q q qqq qqq q qqqqqq q qqq q qqq qq qqqq qq qq q q q q q q q qqqqqqqqqq qqqqqqqqqqqqqq qq qqqqqq qq q qqq qqqqqq qqqqq qq qqqqq qq q q qq q q qqqqqqqqqq qqq q q qq q q qqq qqqqq qqqqq qq qqqq qq qq q q q q qqq qq q q q q q qqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqq q qq q q qqqqqqq q qq q qqqqqqqqqqqqqqqqqqqqqqq qqqq qqqqq q qqqqqqqqqqqqqqqq qqq q qq q qqq q q q q q q q q q qqq q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q qqqq qqqqqq q qqq q q q qqqqqqqqqqqqqqqqqq qqqqq q q q q qqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqq q q q qq qqqqqqqqqqqqqqq qqq qq qqqq qqqqqq qqqqq q q qqqqq qqqqqqqqqqqqqqqqqqq qqqq q q qq q q q q q q qqqqq qq q qq q q qq qq q q q qq q q q q qq q q q q q qq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q qq q q qqqqqq qqqq qqqqqqqqqqqqqqqqqq qq qq q qq q q qq q qqq q qq qqq qq qqq qqq q q q q qq qqqqqqqqqqqqqqqqqq qq q q q q q q qq q qqq qqqqqqqqqqqqqqqqqqqqqq q q q q q qq q qq q q qq qq qqq q qq qqq q q q q q qq q q q q q qqqqqqqqqqqq qqqqqqqqqqqqqqq q qqq qqq qq q qqqqqqqqqqqqqqqq qqqq qq q q q q qqq q qqqqqqqqqq qqq q q q qqq q q qq qq q q qq q q q q q qqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqq qqq q q q qqq qqq qqqqqq qq q qq q q q q qq q q qqqq qqqqqqqqqqqqqqqqqq q q q q q q q qqqqqqq q q q q q q q qqqqqqqqqqqqqqqqqqqqqq qqq q qq q q q q q q qqqqqqqqqqq q q q q q q q q q q q qq q q qq q q qq q q q qqq q qqqq qqqqqqqqqqqqqqqqqq q qqq q q q q q q qqqqqq qqqqqqqqq qqqqqqqqqqqqq qqqq q q q q qqq qqqqqqqqqqqqqqq qq q q q q q qq q q q q q q qq qq qqqqqqqq qq qqqqqqqq qq qq q q qq qq qq qq qq q q q q q q q q q qqqqqqqqqqqqqqqqq qqqqqqq qq q q q q q q qq q q q q qq q qq q q q qq q q q q q q q qq qqqqqq qqqqqqqqqqqqq qqq q q q q q qq qqqqq qqqqqqqqqqqqqqqqqqqq q qq q q q q qq qqq qqq q qqq q q q q qq q q qq qqq q qq qqq qqqqqq q q q q q q q qqqqqqqqqqqqqqqqqqqqqqq qqqq q q q q q qqq q q q q q q q q qqqqqqqqqq qqqqq q qqq q qqq q q q qqqq q qqq q q qqqqqq qq qqq q q qq q q qq q q q q q q q qqqqqqqq qqq qqqqqqqqqq qq qq qq q q qq q q q q q qqqqqqqqq qqq qqqq qq qqqqqq qq qq q q qq q q q q q q q q q q q q q qqq qq q q q q q q q q q qqqqqq qqqqqqqqqq q q q q qq q q q q q q qqqqqqqqqq qqqqqqq qq qq qqq q q q qq qq q q q q q q q q qq q q q q q qqq q qqqq q qq qqqqqq q qq q q q q q q q q qq qqq qqqqqqqq q q q q q qq q q q q q q q qq qq qqq q q q q q q q qq qqq qq q qq q qq q qq q q q q q q q q q qq q q q qq q q q qq q q q q q q q qq qq qqqq q qq q q qq q qq qq q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q qq qq q qq q q q q q q q q qq q qq q qq q q q q q q q q qq q q q q q qq qq qq qq q q q q q q q q qqq q q qq q q q q q qq q q qq q q q q q q qqq qqq q q q q q q q q q q q qqq qq q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q qq q q qq q qq q q q q q q q q q q q q q q q q qq qq q qq q q q qq q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq qqq q q q q q q q q q qq q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q -20 -15 A = log2( Liver q q q q qq q q q Housekeeping genes Unique to a sample -10 NL .Normalization Required (c) 0.log2(Kidney NK) 5 q q qq q q q q qqqq q q q q q q q q q q q q qq q q qq qq qq q q q qq qq qq q q qq q q q q q qqq q q qqqq qq q qqqqq q qq q qqq qq qqq q qqqqqq q qq q qqqqqqq q qqqq q q qqqqqqq qqqqqqq q q qq qqqqqqq qqq qq qqqqqqq qqqqqqq qqqqqqq qqqq qq qqqqqqq q qqqqqqq qqqqqq qqqqqqq qqqqqqq q qq q qqqqqqq qqqqqqq qqqqqqq qqq qqq qqqqqqq qqqqqqq qqqqqqqq qqqqq 0 -4 qqqqqqq q qqqqqq qqqqqqqq qq qq qq qqqq qqqqqqq qqqqqqqq qqqqqqq qqq qq qqq qq qqqq qq qq qq qq qq qq qqqqqqqq qqqqqq q qq qq qqqqq qq qqqq qqqqqqq qqqqqqq qqqqqqqq qqq q q qqq qqqq qqqqqqq qq q qqqqqq qqqqqqq qqqqqqq qq qqqqqqq qqqqqqq qq qqqq -5 -6 M = log2(Liver NL) .2 0. Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 10/53 .4 0.

In the latter case we can be much more confident that there is a true difference between the two treatments than in the first one. Longer transcript are expected to generate more reads. Thus. but not for statistical testing! Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 11/53 . RPKM/FPKM are useful for reporting expression values. Why? Consider the following example: in two libraries. the RPKM values would be the same for both scenarios. However. each with one million reads. gene X may have 10 reads for treatment A and 5 reads for treatment B. This is the most relevant correction of this method. The latter is only relevant for comparisons among different genes which we rarely perform! RPKM/FPKM are not suitable for statistical testing. the more reads we expect from each gene. while it is 100x as many after sequencing 100 millions reads from each library.Be Careful with RPKM/FPKM Values RPKM Concept (FPKM is paired-end version of it) RPKM (FPKM): reads (fragments) per kp per million mapped reads The more we sequence.

Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 12/53 . For instance. The TMM methods tries to correct this bias. will underestimate the expression of weaker expressed genes in the presence of extremely abundant mRNAs (less sequencing real estate available for them). Method implemented in edgeR library (Robinson et al.. 2010). Most scaling based methods. in one sample a large number of reads comes from rRNAs while in another they have been removed more efficiently.TMM Method Corrects for RNA Composition Bias Trimmed Mean of M Values (TMM) by Robinson and Oshlack (2010) Many normalization RNA-Seq normalization methods perform poorly on samples with extreme composition bias. including RPKM and CPM.

Analysis of Differentially Expressed Genes (DEGs) Data is discrete. positively skewed ⇒ no (log-)normal model Small numbers of replicates ⇒ no rank based or permutation methods Sequencing depth (coverage) varies among samples ⇒ normalization Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 13/53 .

which are derived from p-values corrected for multiple testing using the Benjamini-Hochberg method. Both are not appropriate assumptions for RNA-Seq data. Statistical Testing Poisson distribution (initially used but not very common anymore) Most statistical methods for RNA-Seq DEG analysis use negative binomial distribution along with modified statistical tests based on that. For variance estimation most methods borrow information across genes Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 14/53 . Thus. t-test assumes normal distribution and no mean-variance dependence. The mutiple testing issue is very similar as in microarray data analysis. Variance estimation and rank-order statistics is difficult on small sample numbers. most tools provide False Discovery Rates (FDRs).DEG Analysis Methods Requirements One would like to perform a t-test or something similar for each gene.

2013) PoissonSeq baySeq ..Software for RNA-Seq DEG Analysis edgeR (Robinson et al. 2010) DEXSeq (Anders et al... 2012) limmaVoom Cuffdiff/Cuffdiff2 (Trapnell et al.. 2010) DESeq/DESeq2 (Anders and Huber.. Analysis of RNA-Seq Data with R/Bioconductor Overview Slide 15/53 .

Packages for RNA-Seq Analysis in R GenomicRanges Rsamtools Link Link : high-level infrastructure for range data : BAM support rtracklayer Link : Import/export of range and annotation data. etc. DESeq DESeq2 edgeR Link Link DEXSeq QuasR : RNA-Seq DEG analysis Link : RNA-Seq DEG analysis : RNA-Seq DEG analysis Link Link Analysis of RNA-Seq Data with R/Bioconductor : RNA-Seq Exon analysis : RNA-Seq workflows Overview Slide 16/53 . interface to online genome browsers.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Slide 17/53 .

It contains four slimmed down FASTQ files (SRA023501 Link ) from A. in a spreadsheet program.ucr. thaliana. Direct your R session into the resulting Rrnaseq directory. To import targets./data/targets. please follow these instructions: Download and unpack the sample data Link for this practical.Data Sets and Experimental Variables To make the following sample code work.g.fastq Tl_fl4a TRL Tl_fl4 SRR064167. as well as the corresponding reference genome sequence (FASTA) and annotation (GFF) file. This is the only file in this analysis workflow that needs to be generated manually.read. e.fastq AP3_fl4b AP3 AP3_fl4 SRR064166.delim(". we run the following commands from R: > download.txt.txt file.fastq AP3_fl4a AP3 AP3_fl4 SRR064155. Start the analysis by opening in your R session the Rrnaseq.fastq Tl_fl4b TRL Tl_fl4 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Slide 18/53 .txt") > targets 1 2 3 4 FileName SampleName Factor Factor_long SRR064154.R script which contains the code shown in this slide show in pure text format.file("http://biocluster. Link The FASTQ files are organized in the provided targets.edu/~tgirke/HTML_Presentations/Manuals/Wor > targets <.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 19/53 .

names=FALSE. align all samples and generate BAM files.read. + clObj=cl.txt" > genomeFile <./data/tair10chr. maxHits=1. allele-specific RNA-Seq. such as BS-Seq. alignmentsDir=results.txt") > write./results" # defines location where to write results > cl <. It uses Rbowtie for ungapped alignments and SpliceMap for spliced alignments. sep="\t") > sampleFile <. quote=FALSE. (1) Evironment settings > library(QuasR) > targets <. row. etc./data/QuasR_samples.txt". cacheDir=results) > # Note: splicedAlignment should be set to TRUE when the reads are >=50nt long > (alignstats <. > proj <.Align Reads Option 1: QuasR QuasR is an extremely versatile NGS mapping and postprocessing pipeline for RNA-Seq and many other application areas.".qAlign(sampleFile.".makeCluster(1) # defines number of CPU cores to use (2) Single command to index reference.1:2]. genome=genomeFile. "data/QuasR_samples.alignmentStats(proj)) # Alignment summary report AP3_fl4a:genome AP3_fl4b:genome Tl_fl4a:genome Tl_fl4b:genome seqlength mapped unmapped 7e+05 1607234 26022 7e+05 1647774 21272 7e+05 206041 4366 7e+05 283742 5279 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 20/53 .fasta" > results <.table(targets[. splicedAlignment=FALSE.delim("data/targets.".

targets$FileName. but not for Windows. output[i]). output_file=output[i]./data/tair10chr.fasta". "".fasta") # Build indexed reference (2) Align all FASTQ files with Rsubread in loop.create("results") # Note: all output data will be written to directory 'results' > buildindex(basename=". TH1=2) + asBam(file=output[i].sam". Includes generation of indexed BAM files.txt") # Import experiment design information > input <. nthreads=8.delim(". ". sep="") > output <./results/tair10chr."./data/targets. indexDestination=TRUE) + unlink(output[i]) + } Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 21/53 .paste(". > targets <./results/tair10chr. destination=gsub(". library(Rsamtools) > dir. readfile1=input[i]. targets$FileName. It is currently only available for OS X and Linux. sep="") > reference <.read. (1) Index reference genome > library(Rsubread).Align Reads Option 2: Rsubread Rsubread is an R/Bioc package that implements an extremely fast aligner for RNA-Seq data. indels=1./data/".sam".paste(".fasta" > for(i in seq(along=targets$FileName)) { + align(index=reference./results/". reference=". overwrite=TRUE.

/data/tair10chr.0").fasta" for(i in seq(along=input)) { unlink(paste(output[i].fasta . sep="")) } Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 22/53 .tophat # -G: supply GFF with transcript model info (preferred!) # -g: ignore all alginments with >g matches # -p: number of threads to use for alignment step # -i/-I: min/max intron lengths # --segment-length: length of split reads (25 is default) system(tophat_command) sortBam(file=paste(output[i]. ".bam"./results/". sep="").tophat". targets$FileName. (1) Index reference genome > library(modules) # Skip this and next line if you are not using IIGB's biocluster > moduleload("bowtie2/2. ".paste(". force=TRUE.fasta") (2) Align all FASTQ files with Bowtie2/Tophat2 in loop. Includes generation of indexed BAM files. sep="") reference <. "./data/tair10chr. output[i].0. ".1.Align Reads Option 3: Bowtie2/Tophat2 Note: this step requires the command-line tools tophat2/bowtie2 Link ./data/tair10chr.paste("./data/". moduleload("tophat/2. ". recursive=TRUE) tophat_command <.tophat/accepted_hits. targets$FileName.top indexBam(paste(output[i]. destination=paste(output[i].8b") # loads bowtie2/tophat2 from module system > system("bowtie2-build .input <.". sep="").create("results") # Note: all output data will be written to directory 'results' input <.tophat/accepted_hits.paste("tophat -p 4 -g 1 --segment-length 15 -i 30 -I 3000 -o ". sep="") output <. > > > > > > + + + + + + + + + + + library(Rsamtools) dir.bam".

fastq$")/4 > bfl <.countLines(dirPath="./data".bam"). library(Rsamtools) > Nreads <. row. Note: the percentage of aligned reads is 100% in this particular example because only alignable reads were selected when generating the sample FASTQ files for this exercise.fastq SRR064166./results/". pattern=". yieldSize=50000 > Nalign <.table(read_statsDF.fastq 1669046 1669046 100 SRR064166.Alignment Summary The following enumerates the number of reads in each FASTQ file and how many of them aligned to the reference. quote=FALSE Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 23/53 .fastq 289021 289021 100 > write.fastq 210407 210407 100 SRR064167.data. For QuasR this step can be omitted because the qAlign function generats this information automatically.fastq SRR064155.names=FALSE.BamFileList(paste0(".fastq 1633256 1633256 100 SRR064155. > library(ShortRead). targets$FileName. "results/read_statsDF.xls".countBam(bfl) > (read_statsDF <. ".frame(FileName=names(Nreads). Nreads=Nreads.fastq SRR064167.fastq FileName Nreads Nalign Perc_Aligned SRR064154. Nalign=Nalign$r + Perc_Aligned=Nalign$records/Nreads*100)) SRR064154.

Quality Reports The following shows how to create read quality reports with QuasR’s qQCReport function or with the custom seeFastq function. > qQCReport(proj. targets$FileName). batchsize=50000.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/fastqQuali > myfiles <. names(myfiles) <. seeFastqPlot(f Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Aligning Short Reads Slide 24/53 . width=4*length(myfiles)).pdf".pdf") > source("http://faculty.seeFastq(fastq=myfiles.targets$SampleName > fqlist <. pdfFilename="results/qc_report. height=18.paste0("data/". klength=8) > pdf("results/fastqReport.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 25/53 .

5)] GRanges with 4 ranges and 2 metadata columns: seqnames ranges strand | type group <Rle> <IRanges> <Rle> | <factor> <factor> [1] Chr1 [3631.. c(2.1 [3] Chr1 [4486."type"]=="chromosome > subgene_index <.end(ranges(gff[which(elementMetadata(gff)[.*". library(Rsamtools) > gff <.which(elementMetadata(gff)[.1 [4] Chr1 [4706. asRangedData=FALSE) > seqlengths(gff) <.1 [2] Chr1 [3996."type"] == "exon") > gffsub <.1 --seqlengths: Chr1 Chr2 Chr3 Chr4 Chr5 ChrC ChrM 100000 100000 100000 100000 100000 100000 100000 > ids <.Import Annotation Data from GFF Annotation data from GFF > library(rtracklayer).gff[subgene_index. 3913] + | exon Parent=AT1G01010. 4605] + | exon Parent=AT1G01010. library(GenomicRanges). ids) # Coerce to GRangesList Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 26/53 .import./data/TAIR10_GFF3_trunc.split(gffsub.] # Returns only gene ranges > gffsub[1:4.gff".gsub("Parent=|\\.gff(". 5095] + | exon Parent=AT1G01010. 4276] + | exon Parent=AT1G01010. elementMetadata(gffsub)$group) > gffsub <. "".

exonsBy(txdb. file=".sqlite") > txdb <.loadDb(". by="gene") Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 27/53 .gff".sqlite") > eByg <./data/TAIR10.makeTranscriptDbFromGFF(file="data/TAIR10_GFF3_trunc.More Robust: Store Annotations in TranscriptDb Storing annotation ranges in TranscriptDb databases makes many operations more robust and convenient. + species="Arabidopsis thaliana") > saveDb(txdb. > library(GenomicFeatures) > txdb <. + dataSource="TAIR". + format="gff3"./data/TAIR10.

/results/countDF") Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 28/53 . ignore.readGAlignmentsFromBam(i) # Substitute next two lines with this + counts <.read. sep="") > names(samplespath) <.samples > countDF <.fastq SRR064167.cbind(countDF.samples > countDF[1:4.fastq SRR064166. quote=FALSE./results/countDF".table(". counts) + } > colnames(countDF) <./results/".names = NA) > countDF <.frame(row.countOverlaps(eByg.names=names(eByg)) > for(i in samplespath) { + aligns <.strand=TRUE) + countDF <.fastq 52 26 60 75 145 77 82 64 5 1 13 14 482 347 302 358 > write. samples. ". aligns. col. ".fastq SRR064155.Read Counting with countOverlaps Number of reads overlapping gene ranges > samples <. sep="\t".data.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 SRR064154.table(countDF.bam".as.paste(".character(targets$FileName) > samplespath <.

index=character()) > countDF2 <.summarizeOverlaps(eByg. yieldSize=50000.BamFileList(samplespath. See here Link for details.assays(countDF2)$counts > colnames(countDF2) <.Read Counting with summarizeOverlaps The summarizeOverlaps function from the GenomicRanges package is easier to use. mode="Union".fastq SRR064167.fastq SRR064166. it provides more options and it is much more memory efficient.strand=TRUE) > countDF2 <. > library(GenomicRanges) > bfl <.fastq SRR064155.samples > countDF2[1:4.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 SRR064154. ignore. bfl.fastq 52 26 60 75 145 77 82 64 5 1 13 14 482 346 285 339 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 29/53 .

table(countDF3.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 width AP3_fl4a AP3_fl4b Tl_fl4a Tl_fl4b 1688 46 24 59 70 1774 115 71 73 50 1905 5 0 13 14 6254 464 323 286 349 > write. > countDF3 <.Read Counting with qCount from QuasR QuasR does everything in one command. reportLevel="gene". "results/countDFgene.qCount(proj. txdb. sep="\ Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 30/53 . col. orientation="any") > countDF3[1:4.xls".names=NA. quote=FALSE.

974951 1197.221101 1626. + rpkm <.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 SRR064154.8093 940.fastq SRR064167.1] * 1000)/colSums(countDF3[.sum(width(reduce(gffsub)))/1000 # Length of exon union + millionsMapped <.83437 139.445066 1556.t(t(countDF3[.sum(counts)/1e+06 # Factor for converting to million of + rpm <.12206 394.counts/millionsMapped # RPK: reads per kilobase of exon model.Simple RPKM Normalization RPKM: reads per kilobase of exon model per million mapped reads > returnRPKM <. function(x) returnRPKM(counts=x.-1]/countDF3[. gffsub) { + geneLengthsInKB <.8389 191.5394 RPKM: for QuasR results > rpkmDFgene <. gffsub=eByg)) > countDFrpkm[1:4.-1]) *1e6) Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 31/53 .apply(countDF.rpm/geneLengthsInKB # RPKM: reads per kilobase of exon model per m + return(rpkm) + } > countDFrpkm <.6169 580.01080 504.function(counts.4825 615.6477 19.3883 1492.fastq SRR064166.770396 229. 2.fastq SRR064155.75249 4.1649 1158.fastq 231.

Reproducibility Check by Sample-Wise Clustering QC check of the sample reproducibility by computing a correlating matrix and plotting it as a tree.label=TRUE.fastq SRR064167. method="spearman") hc <.hclust(dist(1-d)) plot.fastq Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 32/53 .margin=TRUE) SRR064155. Note: the plotMDS function from edgeR is a more robust method for this task.fastq SRR064166.node. type="p".phylo(hc).cor(countDFrpkm. > > > > library(ape) d <.fastq SRR064154. edge.width=3.phylo(as. no. show.col=4. edge.

Task 4 Plot the result of the most pronounced antisense expression case with ggbio. Task 2 Count reads in sense and antisense.Exercise 1: QuasR with Antisense Read Counting Task 1 Align reads from all 4 samples. Why is this analysis meaningless for the provided non-strand-specific RNA-Seq samples? Task 3 Identify all genes where the antisense counts are ≥3-fold higher than the sense counts in at least 2 out of the 4 samples. Discuss differences. Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Counting Reads per Feature Slide 33/53 .

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 34/53 .

countDFrpkm_mean[is.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 SRR064154.fastq 185.7279 542.663489 504.7285 12./results/degs2fold./results/degs2fold.3]).4639 1.90466 1177.xls") Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 35/53 .names = NA) > degs2fold <.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg. ".fastq_SRR064155.26145 210. group=c(1.3] <= -1.] > degs2fold[1:4.colAg(myMA=countDFrpkm.fastq SRR064166.7285 1. log2ratio=log2(countDFrpkm_mean[.26145 210.2.1.fastq_SRR064155.78356 1248.1])) > countDFrpkm_mean <.table(".103180 542.306723 12.8237 2.11595 1559.fastq SRR064166.xls".7279 4.] AT1G01010 AT1G01020 AT1G01030 AT1G01040 SRR064154. col.2).8237 504.Identify DEGs with Simple Fold Change Method Compute mean values for replicates > source("http://faculty. quote=FALSE.countDFrpkm_mean[countDFrpkm_mean[.4639 Log2 fold changes > countDFrpkm_mean <.524377 > write.fastq_SRR064167.ucr.read.2]/countDFrpkm_mean[.90466 1177.table(degs2fold. sep="\t".78356 1248.3] >= 1 | countDFrpkm_mean[. ] > degs2fold <.11595 1559.fastq log2ratio 185. myfct=mean) > countDFrpkm_mean[1:4.fastq_SRR064167.cbind(countDFrpkm_mean.R") > countDFrpkm_mean <.finite(countDFrpkm_mean[.

AT1G01010 AT1G01020 AT1G01030 AT1G01040 > > > > > > > SRR064154.296498 9.targets$Factor cds <. "TRL") # Calls DEGs with nbinomTest res <.1:8] 5 6 7 14 id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj AT1G01050 595.na.50810621 1.487233e-05 AT2G01008 20.717372 53.fastq 52 26 60 75 145 77 82 64 5 1 13 14 482 347 302 358 cds <.newCountDataSet(countDF.32706322 1.nbinomTest(cds.326598 7.table(".fastq SRR064155.estimateSizeFactors(cds) # Estimates library size factors from count data.omit(res) res2fold <./results/countDF") conds <.693390 428.220617 1.fastq SRR064166.01065 37.878492e-18 9.32185294 3.363593 3.06508364 -3.126601 915. one can provi cds <.estimateDispersions(cds) # Estimates the variance within replicates res <. Alternatively.res2fold[res2fold$padj <= 0.fastq SRR064167.445565 0.05.946596e-17 AT1G01060 299.40527 170.908712e-05 2. ] # CountDataSet has similar accessor methods as eSet class.734249 7.50693 5.941561 4.] res2foldpadj <. "AP3". ] res2foldpadj[1:4. conds) # Creates object of class CountDataSet derived from eSet class counts(cds)[1:4.res[res$log2FoldChange >= 1 | res$log2FoldChange <= -1.Identify DEGs with DESeq Library Raw count data are expected here! > > > > > library(DESeq) countDF <.413061e-05 6.507791e-07 AT1G01070 29.575725 2.141055e-08 4.24510 275.155565e-04 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 36/53 .read.117153 2.

edge2fold[edge2fold$FDR <= 0. n=4) Comparison of groups: TRL-AP3 logFC logCPM PValue AT3G01120 3.060290e-126 1.as. ] Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 37/53 .12701 2.read.539622 13. Note: raw read count data are expected by all methods! > > > > > > > library(edgeR) countDF <.estimateCommonDisp(y) # Estimates common dispersion y <.Identify DEGs with edgeR’s Exact Method DEG analysis with classical edgeR approach. pair=c("AP3".500250e-128 AT1G01100 2.189185 15.exactTest(y.907910e-113 2.edge[edge$logFC >= 1 | edge$logFC <= -1.289500e-115 AT1G01050 3.747447 17.781044e-106 > edge <. n=50000)) > edge2fold <.07336 3.79932 6.estimateTagwiseDisp(y) # Estimates tagwise dispersion et <.data.table(". group=conds) # Constructs DGEList object y <.DGEList(counts=countDF.frame(topTags(et./results/countDF") y <. topTags(et.415745 13.01.338291e-107 FDR 4.577536e-115 ATMG00030 -4. "TRL")) # Computes exact test for the negative binomial distribution.543314e-113 6.] > edge2foldpadj <.78303 3.

1]) # Takes DGEGLM object and carries out the likelihood ratio test.estimateGLMTagwiseDisp(y. design) # Estimates tagwise dispersions ## Fit the negative binomial GLM for each tag fit <.data. design. colnames(design) <.frame(topTags(lrt.Identify DEGs with edgeR’s GLM Approach DEG analysis with edgeR using generalized linear models (glms) > > > > > > > > > library(edgeR) countDF <.edgeglm[edgeglm$logFC >= 1 | edgeglm$logFC <= -1. design) # Estimates trended dispersions y <. verbose=TRUE) # Estimates common dispersions Disp = 0.01. group=conds) # Constructs DGEList object ## Filtering and normalization keep <.read.] edgeglm2foldpadj <. data=y$samples).makeContrasts(contrasts="AP3-TRL".rowSums(cpm(y)>1) >= 2.DGEList(counts=countDF.01892 . ] y <./results/countDF") y <.levels(y$samples$group) # Design matrix ## Estimate dispersion y <. y <.matrix(~0+group.y[keep.table(".as.1375 > > > > > > > > > > y <.estimateGLMCommonDisp(y.glmFit(y.estimateGLMTrendedDisp(y. contrast=contrasts[. levels=design) # Contrast matrix is optional lrt <. edgeglm <.model.glmLRT(fit. BCV = 0. design) # Returns an object of class DGEGLM contrasts <.calcNormFactors(y) design <. n=length(rownames(y)))) ## Filter on fold change and FDR edgeglm2fold <. ] Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 38/53 .edgeglm2fold[edgeglm2fold$FDR <= 0.

overLapper(setlist=setlist. S2 = 46. S3 = 31. S4 = 74 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 39/53 .list(edgeRexact=rownames(edge2foldpadj).ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.Comparison Among DEG Results > > > > > source("http://faculty. DESeq=as.sapply(OLlist$Venn_List. edgeRglm=rownames(edgeglm2foldpadj). mymain="DEG Comparison") DEG Comparison edgeRglm edgeRexact DESeq 5 RPKM 0 1 0 0 5 0 4 2 24 0 7 0 3 33 Unique objects: All = 84. S1 = 64. sep="_". type="vennsets") counts <. length) vennPlot(counts=counts.R") setlist <.character(re OLlist <.

matrix(y)))) y <.countDFrpkm[rownames(edgeglm2foldpadj)[1:20]. library(gplots) y <.targets$Factor y <.5 ATCG00270 ATCG00120 ATCG00020 ATMG00160 ATCG00130 ATCG00140 ATCG00280 Gene ID ATCG00340 AT2G01021 ATCG00490 ATCG00480 ATMG00030 ATCG00350 ATCG00170 AT2G01008 ATMG00090 AT1G01070 AT1G01050 AT4G00050 AT3G01120 AP3 AP3 TRL TRL Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis DEG Analysis Slide 40/53 . height=0. "yellow". main="Expression Values Expression Values (DEG Filter: FDR 1%.0 0.] levelplot(t(y).0 0.5 1.y[order(y[. FC > 2) −1. "white").0 1. "darkblue". > > > > > > library(lattice). The following shows the scaled expression values (here RPKMs) in form of a heatmap.Heatmap of Top Ranking DEGs Note: gene-wise clustering is not possible with a single sample pair. col.1]).t(scale(t(as.2.regions=colorpanel(40.] colnames(y) <.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis GO Analysis Slide 41/53 .

universeGeneIds = geneUniverse.res2foldpadj[.html") Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis GO Analysis Slide 42/53 a a a a .5.002673178 18 2. file = "results/MyhyperGresult.rownames(countDF) geneSample <.002673178 18 2.002673178 18 2.126582 6 7 ion 0.new("GOHyperGParams". annotation="ath1121501". library(GO. geneIds = geneSample.002673178 18 2. pvalueCutoff = 0.db) geneUniverse <.126582 6 7 monovalent inorganic cation 0.] 1 2 3 4 GOMFID GO:0008324 GO:0015075 GO:0015077 GO:0015078 Pvalue OddsRatio ExpCount Count Size 0.db). library(ath1121501.126582 6 7 hydrogen ion transmembrane transmembrane transmembrane transmembrane transporter transporter transporter transporter > htmlReport(hgOver. testDirection = "over") hgOver <. ontology = "MF". > > > > + + > > library(GOstats).126582 6 7 cation 0. Another package.Enrichment of GO Terms in DEG Sets The following performs GO term enrichment analysis of one of the identified DEG sets using the GOstats Link package.1] params <. among many others.hyperGTest(params) summary(hgOver)[1:4. conditional = FALSE. to consider here is the goseq Link that considers gene length bias in RNA-Seq data.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis View Results in IGV & ggbio Slide 43/53 .

ba http://faculty.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064155. Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis View Results in IGV & ggbio Slide 44/53 .fastq.457 in position menu on top..edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064154.ucr.ba http://faculty.fastq.fastq.ucr.Inspect Results in IGV View results in IGV Download and open IGV Link Select in menu in top left corner A.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064167.ucr..ba To view area of interest.457-51.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rrnaseq/results/SRR064166. thaliana (TAIR10) Upload the following indexed/sorted Bam files with File -> Load from URL. enter its coordinates Chr1:49.fastq.ba http://faculty. http://faculty.

fill = strand)) p2 <. geom = "rect".3. 0. IRanges(49457. fill = strand)) p3 <.5 kb RNA-Seq Analysis 51 kb View Results in IGV & ggbio Slide 45/53 .expr = "gene_id") tracks(AP3=p1. which=GRanges("Chr1".readGAlignmentsFromBam(". use.readGAlignmentsFromBam("./results/SRR064154.autoplot(txdb.fastq.bam".names=TRUE. heights = c(0.autoplot(TRL. 51457)).names=TRUE. param=ScanBamParam(which=GRange p1 <.Generate Similar View with ggbio Programmatically library(ggbio) AP3 <. 0. names. Transcripts=p3. use. param=ScanBamParam(which=GRange TRL <. TRL=p2. geom = "rect".fastq.autoplot(AP3.3.bam". aes(color = strand.4)) + ylab("") strand AP3 > > > > > > > + − TRL strand + − Transcripts AT1G01100 AT1G01100 AT1G01100 AT1G01100 50 kb Analysis of RNA-Seq Data with R/Bioconductor 50. aes(color = strand./results/SRR064166.

Note: the definition of up and down is arbitrary and one needs to check how it is defined by the different DEG methods! Task 2 Do the same for the downregulated genes. Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis View Results in IGV & ggbio Slide 46/53 . Task 3 Compare the overlaps among the different up/down sets in a single 4-way venn diagram.Exercise 2: Venn Diagram for Up/Down DEGs Task 1 Store the identifiers of the upregulated genes from each of the four DEG methods in separate components of a list.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Differential Exon Usage Slide 47/53 .

table(". "./results/countDFdex") Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Differential Exon Usage Slide 48/53 . counts) + } > colnames(countDFdex) <.Analysis of Differential Exon Usage with DEXSeq Number of reads overlapping gene ranges > source("data/Fct/gffexonDEXSeq.as./results/countDFdex".names = NA) > countDFdex <. + counts <.frame(row.1 Parent=AT1G01010:E004__Chr1_4706_5095_+_Parent=AT1G01010. col.samples > countDFdex[1:4.character(elementMetadata(gffexonDEXSeq)[.1 SRR064154.countOverlaps(gffexonDEXSeq. aligns) + countDFdex <.cbind(countDFdex.1 Parent=AT1G01010:E003__Chr1_4486_4605_+_Parent=AT1G01010.readBamGappedAlignments(i) # Substitute next two lines with this one.names=ids) > for(i in samplespath) { + aligns <. quote=FALSE.table(countDFdex.1 Parent=AT1G01010:E002__Chr1_3996_4276_+_Parent=AT1G01010.fastq SRR064155. sep="\t".exons2DEXSeq(gff=gff) > ids <.1:2] Parent=AT1G01010:E001__Chr1_3631_3913_+_Parent=AT1G01010.R") > gffexonDEXSeq <. "ids"]) > countDFdex <.data.fastq 2 4 2 1 3 3 6 1 > write.read.

.estimateSizeFactors(exset) ## Evaluate variance of the data by estimating dispersion using Cox-Reid (CR) likelihood estimation exset <.0 ## Construct ExonCountSet from scratch exset <. geneIDs(exset).estimateDispersions(exset) .01.as. names(samples) <.] ## Performs normalization exset <.fitDispersionFunction(exset) ## Performs Chi-squared test on each exon and Benjmini-Hochberg p-value adjustment for mutliple testing exset <.DEUresultTable(exset) ## Count number of genes with differential exon usage table(tapply(deuDF$padjust < 0.Analysis of Differential Exon Usage with DEXSeq Identify genes with differential exon usage > > > > > > > > > library(DEXSeq) samples <. any)) FALSE 20 TRUE 1 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis Differential Exon Usage Slide 49/53 .testForDEU(exset) ## Estimates fold changes of exons exset <.na(countDFdex)] <.character(targets$Factor).targets$FileName countDFdex[is.newExonCountSet2(countDF=countDFdex) # fData(exset)[1:4. Done > > > > > > > > > > ## Fits dispersion-mean relation to the individual CR dispersion values exset <...estimatelog2FoldChanges(exset) ## Obtain results in data frame deuDF <.

omit(deuDF[deuDF$geneID %in% unique(deuDF$geneID).DEXSeq Plots Sample plot showing fitted expression of exons > > > > plotDEXSeq(exset. path="results".html") AP3 Parent=AT1G01100 − TRL Fitted expression 1000 100 10 1 E001 50090 Analysis of RNA-Seq Data with R/Bioconductor 50213 E002 50336 E003 50459 50582 RNA-Seq Analysis E004 50705 E005 50828 50951 E006 51074 51197 Differential Exon Usage Slide 50/53 . expression=TRUE."geneID"])) DEXSeqHTML(exset. displayTranscripts=TRUE. geneIDs=mygeneIDs.])[. "Parent=AT1G01100". file="DEU.unique(as.character(na. legend=TRUE) ## Generate many plots and write them to results directory mygeneIDs <.

11.1-3 scales_0.Session Information > sessionInfo() R version 3.3 Analysis of RNA-Seq Data with R/Bioconductor RNA-Seq Analysis grDevices methods BiocInstaller_1.1 DBI_0.6.40.2 (2013-09-25) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics other attached packages: [1] DEXSeq_1.22.3 labeling_0.7 RSQLite_0.14.0-11 [28] QuasR_1.24.1.8.1-0 Biobase_2.5 [19] colorspace_1.2 base xtable_1.0 ath112150 gplots_2.0-0 [28] gtools_3.18 Differential Exon Usage Hmisc bioma genef munse stats Slide 51/53 . rtracklay IRanges_1 GSEABase_1.2 splines_3.0 XML_3.6-26 statmod_1.0 gdata_2.0 GenomicRanges_1.2 latticeExtra_0.2.0 Rbowtie_1.98-1.0.3.3 [37] rpart_4.12.2 loaded via a namespace (and not attached): [1] AnnotationForge_1.7-1 Matrix_1.1 digest_0.0.0 XVector_0.24.2.14.4.1 hwriter_1.2 utils datasets ggbio_1.30.2 BSgenome_1.13.95-4.2-7 AnnotationDbi_1.0 annotate_1.4 GenomicFeatures_1.1 VariantAnnotation_1.10.0 [10] GO.8.0 [10] RCurl_1.1 [19] ape_3.db_2.2-4 dichromat_2.4.2.9.10.2.0 ggplot2_0.

Outline Overview RNA-Seq Analysis Aligning Short Reads Counting Reads per Feature DEG Analysis GO Analysis View Results in IGV & ggbio Differential Exon Usage References Analysis of RNA-Seq Data with R/Bioconductor References Slide 52/53 .

. URL http://www.org/display.. S.hubmed. Genome Biol 11 (10). Detecting differential usage of exons from RNA-seq data. McCarthy.cgi?uids=20196867 Trapnell. M. Goff.. J. W. Jan 2013. Genome Biol 11 (3).. Differential expression analysis for sequence count data.. Smyth. D.. 2008–2017. A scaling normalization method for differential expression analysis of RNA-seq data.org/display. 2010. Genome Res 22 (10). D.org/display. URL http://www. L. Bioinformatics 26 (1).. C.hubmed. K.hubmed.hubmed. A. G.. L... edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Mar 2010.org/display. 46–53. URL http://www. W.hubmed.. Huber. J. M. M. Jan 2010. Pachter. URL http://www. Oshlack. Sauvageau. D. D. L. Oct 2012..org/display. S.cgi?uids=19910308 Robinson.. Nat Biotechnol 31 (1). Differential analysis of gene regulation at transcript resolution with RNA-seq.. G. A.. Hendrickson. Reyes..cgi?uids=22722343 Robinson.References I Anders. Huber. Rinn. 139–140. URL http://www.cgi?uids=23222703 Analysis of RNA-Seq Data with R/Bioconductor References Slide 53/53 .cgi?uids=20979621 Anders.