RNA-seq Bioinformatics: Format and QC of Short Reads

RNA-seq Bioinformatics:
Format and QC of short reads
Lucy Poveda, PhD

lucy.poveda@fgcz.ethz.ch
BIO675
Zurich, October 2020
Basic workflow of RNA-seq
Total RNA Enrich for mRNA
AAAAAAA
AA
AAAAAAA AAA
AA
AA
AAA
AA
AA
AAA A
AA
A
AAA
A
A
Sequencing adapters Convert mRNA to cDNA Fragmented mRNA

+
Denature library Next Generation

Library dsDNA -> ssDNA Sequencer Reads/ fastq
AGCTAGCGGCTGAAACTTGCAGCATAC
TGCATAGGCTGAAGGGCTGCAGCAGCA
GGCATAGGATGCAGCATAAGCAGTTA
TTAGCATAGATGCAGCTTGGCAAGTA
GTAGCATAGATGCAGCTGGGCAACGAA
Next Generation sequencers
Short-length and Long-length and

high number of reads lower number of reads
2nd generation sequencers Single molecule sequencers
Optical
Illumina PacBio
Ion Torrent Oxford Nanopore
Non-optical 3
Next Generation sequencers
Short-length and Long-length and

high number of reads lower number of reads
2nd generation sequencers Single molecule sequencers
Optical
Illumina PacBio
Ion Torrent Oxford Nanopore
Non-optical 4
Experimental Design
RNA
Sequencing
fastq
Data Quality Control

fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF
Differential
Expression
Analysis
5
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq

fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF
Differential
Expression
Analysis
6
Properties of sequencing data
Technology Read Length Accuracy Major error type

Sanger 400 to 900 bp 99.9% Mismatch
Illumina 50 to 300 bp 98% Mismatch
ONT Limited only by the DNA 65%-88% Indel
molecules presented
PacBio 10 kb to > 40 Kb 99.9999% circular Indel
consensus; 87% subread
7
Single-end versus paired-end sequencing
• Single-end (SE) reads: Each library fragment is sequenced for one end only
– Cheaper and faster
Cluster 1 > WT1_R1: TCAGTT…
• Paired-end (PE) reads: Each library fragment is sequenced from both ends
– Improves the allocation of a fragment to a reference genome, specially in
areas close to repeat regions
Cluster 1 > WT1_R1: TCAGTT…
+ Cluster 1 > WT1_R2: CTATCG…
8
Sequence data – fastq files
• Data delivery (fastq files)

• http://fgcz-gstore.uzh.ch/projects/pXXXX/NextSeq500_20160413_NS38_o2430/
• File names
• Single end • Paired end
• 20160413.A-APC_mut_1_R1.fastq.gz
• 20160413.A-APC_mut_1_R1.fastq.gz • 20160413.A-APC_mut_1_R2.fastq.gz
• 20160413.A-APC_mut_2_R1.fastq.gz • 20160413.A-APC_mut_2_R1.fastq.gz
• 20160413.A-APC_mut_2_R2.fastq.gz
9
Fastq file format
1. Header line for Read (starts with “@” and the sequence ID)
2. Sequence
3. Header line for Qualities (starts with “+”)
4. Quality score
10
Phred scores
• Measure base calling accuracy
Accuracy of assigning bases
(nucleobases) to signal peaks
•P
error probability of a given base
call
•Q
-10log10P
• Assign to each base
• Range from 0-41 for Illumina

sequencing
Ewing B, Green P. 1998. Genome Res. 8(3):186-194. 11

How are Phred scores generated?
Signal intensity
Signal resolution
Phred score
(i.e. Signal/noise)
Base position and composition in the read
Phred algorithm (Ewing and Green, 1998)

How are Phred scores generated?
Signal intensity
Signal resolution
Phred score
(i.e. Signal/noise)
Base position and composition in the read
• Parameters measured during sequencing real • Scores assigned by

samples searching the look-
• Exact sequences are not known up table
Phred algorithm (Ewing and Green, 1998)

Phred scores can be ASCII encoded
• Add an offset and convert the sum to ASCII
• Current format
• Illumina 1.9 ( i.e. Sanger format)
• Phred scoring: 0-41;
• Offset: 33
• 41+33=74 (J)
• All current sequencers
Million of reads
16
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq

fastq
Reference
Genome
Read Mapping fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
counts
GFF/GTF
Differential
Expression
Analysis
17
Million of reads
s t Q C r e e n
-F a t q S c
- fa s
18
FastQC
• Can be embeded in workflows as a data analysis module

• Can also be run as an independent application with GUI using a laptop
– http://www.youtube.com/watch?v=bz93ReOv87Y
19
Boxplots, histograms, heatmaps
• Fastqc uses boxplots and histograms for the analysis of phred scores and GC
content
• Heatmaps can also be used. They are more visual and colors can be used to
represent different information
20
Different scenarios
Adaptor1 Insert DNA /cDNA Adaptor2
300bp 100bp 75bp
100bp 100bp 100bp
(Insert DNA /cDNA) 100bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)
read read
21
Bias and errors
• Library construction could • Sequencing errors

introduce bias • Chemical, optical, computational
• Fragmentation, ligation,
amplification
• GC bias
• Over-amplification
• Contamination
Per base sequence quality - FastQC
• Range of quality values across all bases at each position

Green: >Q28, good
Orange: >Q20, reasonable

• High and relatively consistent quality along the
reads
• Quality degrades with increasing length is normal
Red:<Q20, poor
– quality trimming
Median > Q25
• Poor quality at the beginning – per tile sequence Median < Q20
quality
• Large variance – per sequence quality scores
Per sequence quality scores - FastQC
• Subset of sequences with universally low quality values
• Single sharp peak

• Mean > Q27
• Bi-modal distribution – per tile sequence quality

• Mean < Q20
Per base sequence content - FastQC
• The portion of A, T, G, and C at each position

• AT (or GC) differ more than 20%
• Biased composition at the read
beginning
• A=T, G=C • Expected with biased priming
• GC content of the sample protocols, i.e. RNA-seq
• Smooth over length
• Expected with biased composition

libraries, i.e bisulfite sequencing
Treatment of DNA with bisulfite converts cytosine

to uracil, but leaves methylated cytosine
unaffected. Therefore, DNA that has been treated
with bisulfite retains only methylated cytosines.
Per sequence GC content - FastQC
• Distribution of average GC in all reads
• we expect to see a roughly normal

distribution of GC content
• the peak corresponds to the
overall GC content of the
underlying genome
• Bi-modal/unusual distribution
• Contaminated/biased subset, i.e.
adaptor dimmers, rRNA etc
Sequence duplication - FastQC
• Relative number of sequences with different degrees of duplication
• Low level duplication is • High level duplication: enrichment bias,

expected for a diverse library saturated sequencing depth
• Normal for RNA-seq (high sequencing
depth) and ChIP-seq (enriched libraries)
Overrepresented sequences - FastQC
• Sequences make up >0.1 % of the total

• Compare those with a contamination database for finding contamination (i.e. adaptor
dimmers)
• Can be normal and biologically meaningful

– highly expressed transcripts
– high copy number repeats
– Less diverse library (amplicons)
Adapter Content - FastQC
75bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)

read
100bp
Comparative heatmap of per base phred score
• Reads in one sample /Average of reads in all samples
• Sample with better quality than average
30
Comparative heatmap of per base phred score
• Sample with lower quality than average
31
Fastqscreen – check for sample contamination
• Compare sequencing reads to databases of known sequences
• Report top matches
• In clonal sample, uniquely mapped reads should come from only a signal
organism
Frequently sequenced organisms rRNA genes (Silva) Frequent contamination
32
Contamination Check
Sequencing data pre-processing tasks
• Trimming: remove bases from read • Filtering: remove reads

end(s)
– Low quality reads
– Adaptor sequence
– Contaminating sequences
– Low quality bases
– Low complexity reads
(repeats)
– Short (<20bp) reads – they

slow down mapping software
34
Tools for pre-processing sequencing data
• PRINSEQ • FASTX
• http://prinseq.sourceforge.net/ • http://hannonlab.cshl.edu/fastx_tool
• Quality/hard trimming, quality kit/
filtering, reformat, ... • Reformat, stats, collapse duplicated
reads, trim, filter, reverse
compliment
• Trimmomatic
• http://www.usadellab.org/cms/?page
=trimmomatic • FlexBar (FAR)
• Adaptor trimming, quality trimming • http://sourceforge.net/projects/thefl
&filtering, ... exibleadap/
• Flexible barcode detection and
adapter removal
35
Summary
• Always generate quality plots for all data sets
• Interpretation of the plots need knowledge about the samples and libraries
• Trim and/or filter data if needed

- Always trim and filter away low quality data for variant analysis
36
Questions ?!
Ø lucy.poveda@fgcz.ethz.ch
37

RNA-seq Bioinformatics: Format and QC of Short Reads

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RNA-seq Bioinformatics: Format and QC of Short Reads

Uploaded by

Copyright:

Available Formats

RNA-seq Bioinformatics:

Format and QC of short reads

Lucy Poveda, PhD

Sequencing adapters Convert mRNA to cDNA Fragmented mRNA

Denature library Next Generation

Short-length and Long-length and

Ion Torrent Oxford Nanopore

Short-length and Long-length and

Ion Torrent Oxford Nanopore

Data Quality Control

Data Quality Control

Technology Read Length Accuracy Major error type

Cluster 1 > WT1_R1: TCAGTT…

• Data delivery (fastq files)

• Assign to each base

• Range from 0-41 for Illumina

Ewing B, Green P. 1998. Genome Res. 8(3):186-194. 11

Base position and composition in the read

Phred algorithm (Ewing and Green, 1998)

Base position and composition in the read

• Parameters measured during sequencing real • Scores assigned by

Phred algorithm (Ewing and Green, 1998)

• Add an offset and convert the sum to ASCII

Data Quality Control

• Can be embeded in workflows as a data analysis module

300bp 100bp 75bp

100bp 100bp 100bp

• Library construction could • Sequencing errors

• Range of quality values across all bases at each position

Orange: >Q20, reasonable

• Subset of sequences with universally low quality values

• Single sharp peak

• Bi-modal distribution – per tile sequence quality

• The portion of A, T, G, and C at each position

• Expected with biased composition

Treatment of DNA with bisulfite converts cytosine

• Distribution of average GC in all reads

• we expect to see a roughly normal

• Relative number of sequences with different degrees of duplication

• Low level duplication is • High level duplication: enrichment bias,

• Sequences make up >0.1 % of the total

• Can be normal and biologically meaningful

75bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)

• Sample with lower quality than average

Frequently sequenced organisms rRNA genes (Silva) Frequent contamination

• Trimming: remove bases from read • Filtering: remove reads

– Short (<20bp) reads – they

• Always generate quality plots for all data sets

• Trim and/or filter data if needed

You might also like