You are on page 1of 37

RNA-seq Bioinformatics:

Format and QC of short reads

Lucy Poveda, PhD


lucy.poveda@fgcz.ethz.ch
BIO675
Zurich, October 2020
Basic workflow of RNA-seq
Total RNA Enrich for mRNA
AAAAAAA

AA
AAAAAAA AAA

AA
AA
AAA

AA
AA
AAA A

AA
A
AAA

A
A

Sequencing adapters Convert mRNA to cDNA Fragmented mRNA


+

Denature library Next Generation


Library dsDNA -> ssDNA Sequencer Reads/ fastq
AGCTAGCGGCTGAAACTTGCAGCATAC

TGCATAGGCTGAAGGGCTGCAGCAGCA

GGCATAGGATGCAGCATAAGCAGTTA

TTAGCATAGATGCAGCTTGGCAAGTA

GTAGCATAGATGCAGCTGGGCAACGAA
Next Generation sequencers

Short-length and Long-length and


high number of reads lower number of reads
2nd generation sequencers Single molecule sequencers
Optical

Illumina PacBio

Ion Torrent Oxford Nanopore

Non-optical 3
Next Generation sequencers

Short-length and Long-length and


high number of reads lower number of reads
2nd generation sequencers Single molecule sequencers
Optical

Illumina PacBio

Ion Torrent Oxford Nanopore

Non-optical 4
Experimental Design
RNA

Sequencing
fastq

Data Quality Control


fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF

Differential
Expression
Analysis
5
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq

Data Quality Control


fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF

Differential
Expression
Analysis
6
Properties of sequencing data

Technology Read Length Accuracy Major error type


Sanger 400 to 900 bp 99.9% Mismatch
Illumina 50 to 300 bp 98% Mismatch
ONT Limited only by the DNA 65%-88% Indel
molecules presented
PacBio 10 kb to > 40 Kb 99.9999% circular Indel
consensus; 87% subread

7
Single-end versus paired-end sequencing
• Single-end (SE) reads: Each library fragment is sequenced for one end only
– Cheaper and faster

Cluster 1 > WT1_R1: TCAGTT…

• Paired-end (PE) reads: Each library fragment is sequenced from both ends
– Improves the allocation of a fragment to a reference genome, specially in
areas close to repeat regions
Cluster 1 > WT1_R1: TCAGTT…
+ Cluster 1 > WT1_R2: CTATCG…

8
Sequence data – fastq files

• Data delivery (fastq files)


• http://fgcz-gstore.uzh.ch/projects/pXXXX/NextSeq500_20160413_NS38_o2430/

• File names
• Single end • Paired end
• 20160413.A-APC_mut_1_R1.fastq.gz
• 20160413.A-APC_mut_1_R1.fastq.gz • 20160413.A-APC_mut_1_R2.fastq.gz
• 20160413.A-APC_mut_2_R1.fastq.gz • 20160413.A-APC_mut_2_R1.fastq.gz
• 20160413.A-APC_mut_2_R2.fastq.gz

9
Fastq file format

1. Header line for Read (starts with “@” and the sequence ID)
2. Sequence
3. Header line for Qualities (starts with “+”)
4. Quality score

10
Phred scores
• Measure base calling accuracy
Accuracy of assigning bases
(nucleobases) to signal peaks

•P
error probability of a given base
call
•Q
-10log10P

• Assign to each base

• Range from 0-41 for Illumina


sequencing

Ewing B, Green P. 1998. Genome Res. 8(3):186-194. 11


How are Phred scores generated?

Signal intensity

Signal resolution
Phred score
(i.e. Signal/noise)

Base position and composition in the read

Phred algorithm (Ewing and Green, 1998)


How are Phred scores generated?

Signal intensity

Signal resolution
Phred score
(i.e. Signal/noise)

Base position and composition in the read

• Parameters measured during sequencing real • Scores assigned by


samples searching the look-
• Exact sequences are not known up table

Phred algorithm (Ewing and Green, 1998)


Phred scores can be ASCII encoded

• Add an offset and convert the sum to ASCII

• Current format
• Illumina 1.9 ( i.e. Sanger format)
• Phred scoring: 0-41;
• Offset: 33
• 41+33=74 (J)
• All current sequencers
Million of reads

16
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq

Data Quality Control


fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF

Differential
Expression
Analysis
17
Million of reads

s t Q C r e e n
-F a t q S c
- fa s

18
FastQC

• Can be embeded in workflows as a data analysis module


• Can also be run as an independent application with GUI using a laptop
– http://www.youtube.com/watch?v=bz93ReOv87Y

19
Boxplots, histograms, heatmaps

• Fastqc uses boxplots and histograms for the analysis of phred scores and GC
content

• Heatmaps can also be used. They are more visual and colors can be used to
represent different information

20
Different scenarios
Adaptor1 Insert DNA /cDNA Adaptor2

300bp 100bp 75bp

100bp 100bp 100bp

(Insert DNA /cDNA) 100bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)
read read

21
Bias and errors

• Library construction could • Sequencing errors


introduce bias • Chemical, optical, computational
• Fragmentation, ligation,
amplification
• GC bias
• Over-amplification
• Contamination
Per base sequence quality - FastQC

• Range of quality values across all bases at each position


Green: >Q28, good

Orange: >Q20, reasonable


• High and relatively consistent quality along the
reads
• Quality degrades with increasing length is normal
Red:<Q20, poor
– quality trimming
Median > Q25

• Poor quality at the beginning – per tile sequence Median < Q20
quality
• Large variance – per sequence quality scores
Per sequence quality scores - FastQC

• Subset of sequences with universally low quality values

• Single sharp peak


• Mean > Q27

• Bi-modal distribution – per tile sequence quality


• Mean < Q20
Per base sequence content - FastQC

• The portion of A, T, G, and C at each position


• AT (or GC) differ more than 20%
• Biased composition at the read
beginning
• A=T, G=C • Expected with biased priming
• GC content of the sample protocols, i.e. RNA-seq
• Smooth over length

• Expected with biased composition


libraries, i.e bisulfite sequencing

Treatment of DNA with bisulfite converts cytosine


to uracil, but leaves methylated cytosine
unaffected. Therefore, DNA that has been treated
with bisulfite retains only methylated cytosines.
Per sequence GC content - FastQC

• Distribution of average GC in all reads

• we expect to see a roughly normal


distribution of GC content
• the peak corresponds to the
overall GC content of the
underlying genome

• Bi-modal/unusual distribution
• Contaminated/biased subset, i.e.
adaptor dimmers, rRNA etc
Sequence duplication - FastQC

• Relative number of sequences with different degrees of duplication

• Low level duplication is • High level duplication: enrichment bias,


expected for a diverse library saturated sequencing depth
• Normal for RNA-seq (high sequencing
depth) and ChIP-seq (enriched libraries)
Overrepresented sequences - FastQC

• Sequences make up >0.1 % of the total


• Compare those with a contamination database for finding contamination (i.e. adaptor
dimmers)

• Can be normal and biologically meaningful


– highly expressed transcripts
– high copy number repeats
– Less diverse library (amplicons)
Adapter Content - FastQC

75bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)


read
100bp
Comparative heatmap of per base phred score
• Reads in one sample /Average of reads in all samples
• Sample with better quality than average

30
Comparative heatmap of per base phred score

• Sample with lower quality than average

31
Fastqscreen – check for sample contamination
• Compare sequencing reads to databases of known sequences
• Report top matches
• In clonal sample, uniquely mapped reads should come from only a signal
organism

Frequently sequenced organisms rRNA genes (Silva) Frequent contamination

32
Contamination Check
Sequencing data pre-processing tasks

• Trimming: remove bases from read • Filtering: remove reads


end(s)
– Low quality reads
– Adaptor sequence
– Contaminating sequences
– Low quality bases
– Low complexity reads
(repeats)

– Short (<20bp) reads – they


slow down mapping software

34
Tools for pre-processing sequencing data

• PRINSEQ • FASTX
• http://prinseq.sourceforge.net/ • http://hannonlab.cshl.edu/fastx_tool
• Quality/hard trimming, quality kit/
filtering, reformat, ... • Reformat, stats, collapse duplicated
reads, trim, filter, reverse
compliment
• Trimmomatic
• http://www.usadellab.org/cms/?page
=trimmomatic • FlexBar (FAR)
• Adaptor trimming, quality trimming • http://sourceforge.net/projects/thefl
&filtering, ... exibleadap/
• Flexible barcode detection and
adapter removal

35
Summary

• Always generate quality plots for all data sets

• Interpretation of the plots need knowledge about the samples and libraries

• Trim and/or filter data if needed


- Always trim and filter away low quality data for variant analysis

36
Questions ?!

Ø lucy.poveda@fgcz.ethz.ch

37

You might also like