Professional Documents
Culture Documents
Read Mapping
2
Mapping is to locate reads in the reference genome
3
Mapping reads to the reference genome
• Mapping outcomes:
4
Mapping reads to the reference genome
• Mapping outcomes:
5
Mapping reads to the reference genome
6
Mapping reads to the reference genome
8
Introduction Read Mapping
9
Mapping millions of reads is challenging
10
Smith-Waterman algorithm
11
Smith-Waterman algorithm example: Where is GATTACA?
12
Smith-Waterman algorithm example: Where is GATTACA?
13
Smith-Waterman algorithm example: Where is GATTACA?
14
Smith-Waterman algorithm example: Where is GATTACA?
15
Smith-Waterman algorithm: limitation
16
How to map millions of short reads - index
• Index
– an alphabetical list of names, subjects, etc, with references to the places where they
occur
– Sorted, structured, allow fast search
17
Hash table (seed) mappers
• Pros:
Ø Tolerant to mismatches
Ø Better at InDel detection
Ø More tolerant to sequence differences (Cross species mapping)
• Cons:
Ø High RAM needed for large genomes (50 Gb for human)
Ø Slower
• Tools:
• Eland, SOAP,MAQ, SHRiMP, GSNAP, …
19
Burrow-Wheeler-Transformation
• sort genome and index (BWT)
• align read base by base to find positions in the
genome
At each step:
• Searching suffix grows by one character
• The size of the range in BWT shrinks/remains the same
• Cons:
Ø Performance decreases with distant matches
Ø Reads with higher error rate
Ø Samples are too distant from the reference genome
• Tools:
- STAR, BWA, bowtie, bowtie2, SOAP2, …
27
Choosing an Aligner
28
Un-spliced vs. spliced mappers
29
Spliced mappers: Exon first
”… there is no tool that outperforms all of the others in all the tests. Therefore,
the end user should clearly specify his needs in order to choose the tool that
provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184
32
Introduction Read Mapping
33
Sequence / Alignment (SAM) files
34
SAM/BAM Header
35
SAM/BAM read information
36
SAM Format Read information
37
SAM Format Read Information
38
SAM Format FLAG Values
39
Mapping quality score
• Phred scaled
– mapP = probability that a read is mapped incorrectly
– mapQ = -10log10mP
– Q30: 1 in 1000 incorrect
– Ranger from 1 to 254
– The higher, the better
– 255: mapping quality not available, but for unique alignments (STAR)
40
Meaning of mapQ30
• The read has few or just one `good' hit on the reference -> best alignment
can be easily identified
• The best alignment has few mismatches -> actual mutations or sequencing
errors.
41
Cause of poor mapQ
• Poor quality reads (Low base quality -> low mapping quality)
• Paired end reads or not (Reads mapped in pairs -> more likely to be correct)
• Poor choice of the mapping software (An algorithm with low sensitivity ->
more mapping errors)
• Improper alignment parameters
42
Introduction Read Mapping
43
Why QC mapping results
44
Mapping QC metrics: Summary stats
• Summary statistics
– % reads with no alignment
– % reads with unique alignment
– % reads with multiple
alignments
Good Suboptimal
47
Sample degradation check: Transcript Coverage Bias
Good Suboptimal
Good Bad
• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC
• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/
• Qualimap
– http://qualimap.bioinfo.cipf.es/
50
Introduction Read Mapping
51
How to visualize mapping results
52
Reference track
Mapped reads in IGV
Loaded
Reads
BAM file
Annotation track
Splice junctions
53
SNPs
54
Single base deletions
55
Single base insertions
56
Sashimi plots
number of split
reads
different color
differentiate
samples
57
Interactive mapping QC using IGV
58
Read mapping: Summary
59
Extra Slides, Read Mapping
60
Burrow - Wheeler Transform (BWT) mappers: Index
reference
aaccaa aa cc a $
$
abaaba$
baaba$a
aaba$ab
aba$aba 61
ba$abaa
a$abaab
Burrow - Wheeler Transform (BWT) mappers: Index
reference
acaaca$
acaaca$
caaca$a
aaca$ac
aca$aca 62
ca$acaa
a$acaac
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index
$acaaca
a$acaac
reference aaca$ac
acaaca$ aca$aca
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
63
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index
$acaaca
a$acaac
reference aaca$ac index
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac 64
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index
$acaaca
a$acaac
aaca$ac
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
65
$acaaca
BWT mappers: Find T knowing BWT(T)
• We give each character a rank, equal to the times the character appear
a1 c1 a2 a3 c2 a4 $
$acaaca F L
a$acaac
aaca$ac
aca$aca $ a1
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix
a3 a2
a4 $
Find T knowing BWT(T):
c1 a3
T= $ c2 a4 68
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= a$ c2 a4
69
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= ca$ c2 a4
70
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= aca$ c2 a4
71
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= aaca$ c2 a4 72
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa a2 c2
caaca$a
BWT matrix
a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= caaca$ c2 a4
73
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T=acaaca$ c2 a4
74
BWT mappers: Find T knowing BWT(T)
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T=acaaca$ c2 a4
75
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
76
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
77
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
78
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
79
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
80
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
81
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
the read position a3 c a $ a c a2
is found a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
82
BWT mappers: Example
reference: a caaca F L
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
83
BWT mappers: Example
reference: a caaca F L
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1 no C!
84
BWT mappers: Bowtie
• A mismatch in the search suffix -> empty BWT range/failed index lookup
• Sequencing error (Illumina 1/1000)
• True mutation (A. thaliana 7X10^-9 per site per generation)
• Mismatches are not rare event (at least 10% of >100 nt reads)
Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 85
BWT mappers: Bowtie
• GSNAP, PALMA
REF: ATCGTGTAACCTGACTAGTTAA
READ: gggGTGTAACC-GACTAGgggg
Cigar: 3H8M1D5M4H 90
Mapping QC metrics: Transcript coverage
91
Mapping QC metrics: Transcript coverage bias
92
Mapping QC metrics: Transcript coverage bias
93
Mapping QC metrics: Junction saturation
94
RNA-seq maping QC tools
• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC
• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/
• Qualimap
– http://qualimap.bioinfo.cipf.es/
95
Splice junction track
• Compute dynamically from junctions from the – strand junctions from the + strand
alignment data (must be from
spliced aligner)
96
Large indels and inter-chromosomal rearrangements
97
Inversion, duplication, translocation
normal
inversion
inversion
duplication/t
ranslocation
98