Professional Documents
Culture Documents
AA
AAAAAAA AAA
AA
AA
AAA
AA
AA
AAA A
AA
A
AAA
A
A
TGCATAGGCTGAAGGGCTGCAGCAGCA
GGCATAGGATGCAGCATAAGCAGTTA
TTAGCATAGATGCAGCTTGGCAAGTA
GTAGCATAGATGCAGCTGGGCAACGAA
Next Generation sequencers
Illumina PacBio
Non-optical 3
Next Generation sequencers
Illumina PacBio
Non-optical 4
Experimental Design
RNA
Sequencing
fastq
Differential
Expression
Analysis
5
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq
Differential
Expression
Analysis
6
Properties of sequencing data
7
Single-end versus paired-end sequencing
• Single-end (SE) reads: Each library fragment is sequenced for one end only
– Cheaper and faster
• Paired-end (PE) reads: Each library fragment is sequenced from both ends
– Improves the allocation of a fragment to a reference genome, specially in
areas close to repeat regions
Cluster 1 > WT1_R1: TCAGTT…
+ Cluster 1 > WT1_R2: CTATCG…
8
Sequence data – fastq files
• File names
• Single end • Paired end
• 20160413.A-APC_mut_1_R1.fastq.gz
• 20160413.A-APC_mut_1_R1.fastq.gz • 20160413.A-APC_mut_1_R2.fastq.gz
• 20160413.A-APC_mut_2_R1.fastq.gz • 20160413.A-APC_mut_2_R1.fastq.gz
• 20160413.A-APC_mut_2_R2.fastq.gz
9
Fastq file format
1. Header line for Read (starts with “@” and the sequence ID)
2. Sequence
3. Header line for Qualities (starts with “+”)
4. Quality score
10
Phred scores
• Measure base calling accuracy
Accuracy of assigning bases
(nucleobases) to signal peaks
•P
error probability of a given base
call
•Q
-10log10P
Signal intensity
Signal resolution
Phred score
(i.e. Signal/noise)
Signal intensity
Signal resolution
Phred score
(i.e. Signal/noise)
• Current format
• Illumina 1.9 ( i.e. Sanger format)
• Phred scoring: 0-41;
• Offset: 33
• 41+33=74 (J)
• All current sequencers
Million of reads
16
Experimental Design
RNA
• Short read file format
Sequencing
• Quality control of short reads
fastq
Differential
Expression
Analysis
17
Million of reads
s t Q C r e e n
-F a t q S c
- fa s
18
FastQC
19
Boxplots, histograms, heatmaps
• Fastqc uses boxplots and histograms for the analysis of phred scores and GC
content
• Heatmaps can also be used. They are more visual and colors can be used to
represent different information
20
Different scenarios
Adaptor1 Insert DNA /cDNA Adaptor2
(Insert DNA /cDNA) 100bp (Insert DNA /cDNA) 75bp 25bp (Adaptor2)
read read
21
Bias and errors
• Poor quality at the beginning – per tile sequence Median < Q20
quality
• Large variance – per sequence quality scores
Per sequence quality scores - FastQC
• Bi-modal/unusual distribution
• Contaminated/biased subset, i.e.
adaptor dimmers, rRNA etc
Sequence duplication - FastQC
30
Comparative heatmap of per base phred score
31
Fastqscreen – check for sample contamination
• Compare sequencing reads to databases of known sequences
• Report top matches
• In clonal sample, uniquely mapped reads should come from only a signal
organism
32
Contamination Check
Sequencing data pre-processing tasks
34
Tools for pre-processing sequencing data
• PRINSEQ • FASTX
• http://prinseq.sourceforge.net/ • http://hannonlab.cshl.edu/fastx_tool
• Quality/hard trimming, quality kit/
filtering, reformat, ... • Reformat, stats, collapse duplicated
reads, trim, filter, reverse
compliment
• Trimmomatic
• http://www.usadellab.org/cms/?page
=trimmomatic • FlexBar (FAR)
• Adaptor trimming, quality trimming • http://sourceforge.net/projects/thefl
&filtering, ... exibleadap/
• Flexible barcode detection and
adapter removal
35
Summary
• Interpretation of the plots need knowledge about the samples and libraries
36
Questions ?!
Ø lucy.poveda@fgcz.ethz.ch
37