3 RNAseq-Mapping LO

RNA-seq Bioinformatics:
Read Mapping
Lennart Opitz, FGCZ

RNA-Seq Course 2020
Introduction Read Mapping
Experimental Design • General idea of mapping

RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts
Differential Expression Analysis
2
Mapping is to locate reads in the reference genome
• Essential step when a reference genome is used

• Many NGS applications:
– Quantity – expression quantification, differential analysis (RNA-Seq)
– Base differences – SNPs, SVs (DNA-Seq)
– Patterns – protein binding sites, methylation (ChIP-Seq)
3
Mapping reads to the reference genome
• Mapping outcomes:
4
• Mapping outcomes:
5
• Mapping outcomes: Spliced alignments
6
• Mapping outcomes: Uniquely mapped reads vs multi-reads

– Repeats, paralogs
– Shared exons in transcriptome Multi-reads
Uniquely mapped reads

Mapping of RNA-seq reads
• Reference (fasta format)

– Genome
– Transcriptome
– Junction library
• When mapped to the genome, RNA-Seq reads can span junctions

– Spliced alignments
– Example:
- Mouse retina 60 nt paired reads
- 41 of 91 Mio map to junctions
8

RNA
Sequencing
fastq
fastq
Reference
Genome
Read Mapping
fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts
9
Mapping millions of reads is challenging
• Quantity • Span junctions

– 20-30 M reads per sample – Exon-exon
– Structural variations
• Mapping uncertainties
– Short reads ØMapping algorithm must be
– Sequencing errors, SNPs, Structural – Fast
variations – Able to handle SNPs, indels,
– No exact reference sequencing errors
– Allow for introns for reference
genome alignment
10
Smith-Waterman algorithm
• Exhaustive search to find the best local alignment
11
Smith-Waterman algorithm example: Where is GATTACA?
12
13
14
15
Smith-Waterman algorithm: limitation
• Too slow: one CPU day per million reads
• High sensitivity not needed

– Only exact and close to exact matches
16
How to map millions of short reads - index
• Index
– an alphabetical list of names, subjects, etc, with references to the places where they
occur
– Sorted, structured, allow fast search
• NGS mapping software

– Index reference file
– Index lookup, not direct sequence comparison
– Two main categories:
- Hash table mappers
- Burrow-Wheeler Transform (BWT) mappers
17
Hash table (seed) mappers
• Cut reference into small “seeds” (k-mers)

• Store seeds and their positions in a lookup table (hash index)
• Take part of the read and look up the hash table
• Extend seeds to full alignments

– Smith-Waterman
– Sensitive but slow
• Shorter seeds -> higher sensitivity, but slower
Flicek P. and Birney E. 2009. Nature Method Supplement

Li H. and Homer N. 2010. Briefs in Bioinformatics. .
Garber M. et al. 2011. Nature Methods
18
Hash table (seed) mappers: Features and tools
• Pros:
Ø Tolerant to mismatches
Ø Better at InDel detection
Ø More tolerant to sequence differences (Cross species mapping)
• Cons:
Ø High RAM needed for large genomes (50 Gb for human)
Ø Slower
• Tools:
• Eland, SOAP,MAQ, SHRiMP, GSNAP, …
19
Burrow-Wheeler-Transformation
• sort genome and index (BWT)
• align read base by base to find positions in the
genome
Trapnell& Salzberg(2009) Nature Biotech 27, 455 -457

20
BWT mappers: Bowtie
Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 21

BWT mappers: Bowtie
• Backward search exact matching

• Searched suffix appears consecutively in BWT

BWT mappers: Bowtie
At each step:
• Searching suffix grows by one character
• The size of the range in BWT shrinks/remains the same

BWT mappers: Bowtie

BWT mappers: Bowtie

BWT mappers: Bowtie

BWT mappers: Features and tools
• Pros:
Ø BWT mappers use as an index the BWT(ref)
Ø Little RAM is needed (2-8 Gb for human genome)
Ø Fast for searching perfect or close matches
• Cons:
Ø Performance decreases with distant matches
Ø Reads with higher error rate
Ø Samples are too distant from the reference genome
• Most NGS mapping software uses this indexing method
• Tools:
- STAR, BWA, bowtie, bowtie2, SOAP2, …
27
Choosing an Aligner
• For RNA-Seq Application it’s the BWT approach is mainly used
28
Un-spliced vs. spliced mappers
• Un-spliced mappers (DNA-Seq)

– Align continuous reads (not containing gaps as a result of splicing)
– When encountering an intron, the aligner stops and trims the rest of the read
– BWA, BOWTIE2, BLAST
– Splice junctions are impossible to detect
• Spliced mappers (RNA-Seq)

– Aware of the presence of introns
– When encountering an intron, the aligner does not stop to trim the rest of the read but
continues to find the next exon
– TopHat, STAR, GMAP, BLAT
29
Spliced mappers: Exon first
• Exonic reads are aligned first, the

remaining reads are divided into
smaller pieces and then mapped to
the genome
• Fast, require less computational

resources
• Bias towards un-spliced

alignments
• MapSplice, SpliceMap, TopHat
Garber M. et al. 2011. Nature Methods 8(6):469-477 30

Spliced mappers: STAR
• Ultra fast (does BWT as much as

possible)
• Start with BWT until a mismatch is

encountered, then store the possible
alignment candidates, and start a new
BWT from scratch instead of doing
MMP = Maximal Mappable pieces
Smith-Waterman search.
• Error tolerant (sensitive)
• Both short and long reads

Dobin A. et. al. 2012. Bioinformatics. doi: 10.1093/bts635 31
Choosing an alignment tool…
• Default options may not be best
”… there is no tool that outperforms all of the others in all the tests. Therefore,
the end user should clearly specify his needs in order to choose the tool that
provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184
32

RNA
Sequencing
fastq
fastq
Reference
Genome
Read Mapping
fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts
33
Sequence / Alignment (SAM) files
• SAM (Sequence Alignment/Map)

– Single unified format for storing read alignments to a reference genome
– Large plain text file
– RNA-Seq: >10Gb
– Exome: >50Gb, Whole Genome: 0.8-1Tb
• BAM (Binary Alignment/Map)

– Binary equivalent of SAM
– Compressed data plus index (bai)
– Developed for fast processing/indexing
– RNA-Seq: >2Gb
– Exome: 2-10Gb, Whole Genome: 100-300Gb
34
SAM/BAM Header
35
SAM/BAM read information
36
SAM Format Read information
37
SAM Format Read Information
38
SAM Format FLAG Values
39
Mapping quality score
• A score that indicates how well the read is aligned

• Probability that a read is mapped incorrectly
• Phred scaled
– mapP = probability that a read is mapped incorrectly
– mapQ = -10log10mP
– Q30: 1 in 1000 incorrect
– Ranger from 1 to 254
– The higher, the better
– 255: mapping quality not available, but for unique alignments (STAR)
• Mapper dependent, difficult to compute
40
Meaning of mapQ30
• The overall base quality of the read is good.
• The read has few or just one `good' hit on the reference -> best alignment
can be easily identified
• The best alignment has few mismatches -> actual mutations or sequencing
errors.
• mapQ30 is usually required for SNP calling algorithms and detection of

structural variations
41
Cause of poor mapQ
• Poor quality reads (Low base quality -> low mapping quality)
• Paired end reads or not (Reads mapped in pairs -> more likely to be correct)
• Reference with poor quality

• High divergence between the sequenced population and reference
• Repeats (Reads in repetitive regions -> very low mapping quality)
• Poor choice of the mapping software (An algorithm with low sensitivity ->
more mapping errors)
• Improper alignment parameters
42

RNA
Sequencing
fastq
fastq
Reference
Genome
Read Mapping
fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts
43
Why QC mapping results
• If the data pre-processing and mapping are done properly

– Extremely low mapping rate for within species read mapping
• If the alignment results fit expected outcomes

– abundance of genomic features
• Capturing of technical issues

– Sample degradation
– Over-amplification issues (Duplication rate)
44
Mapping QC metrics: Summary stats
• How well did reads align to

the reference?
• Summary statistics
– % reads with no alignment
– % reads with unique alignment
– % reads with multiple
alignments
• we aim for >70% unique alignments

• Samples with lower mapping rates
should be investigated (contamination
vs. quality issues) 45
Read position specific error rate
Good Suboptimal
Ø Hard trimming of first 2-3 bases

46
Mapping QC metrics: Abundance of genomic features
• Relative abundance of annotation features: intron, exon, up/down stream,

unannotated
• Are there samples with different abundance of certain features?
47
Sample degradation check: Transcript Coverage Bias
Good Suboptimal
Ø Exclude degraded sample

48
Overamplification check: Duplication rate QC
Good Bad
Ø Repeat library prep

49
RNA-seq maping QC tools
• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC
• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/
• Qualimap
– http://qualimap.bioinfo.cipf.es/
50

RNA
Sequencing
fastq
fastq
Reference
Genome
Read Mapping
fasta
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts
51
How to visualize mapping results
• Using a genome browser

– Software enable users to browse multiple data types and annotations in the context of
the genome
• UCSC and Ensembl browser

– Reference genome should be hosted on the UCSC/Ensembl server
– Upload customized data files or import via URL
• Integrative Genomics Viewer (IGV) (Broad institute)

– Open source
– Well maintained, actively developed
– Platform independent, easy to use
52
Reference track
Mapped reads in IGV
Loaded
Reads
BAM file
Annotation track
Splice junctions
53
SNPs
54
Single base deletions
55
Single base insertions
56
Sashimi plots
• Quantitatively visualize splice junctions Coverage

Splice junctions
number of split
reads
different color
differentiate
samples
57
Interactive mapping QC using IGV
• Are my data behaving as expected

– No reads mapped to the knock-out site
58
Read mapping: Summary
• Different mapping tasks • Spliced mappers

– No splicing: general mappers – Build upon general mappers
– Splicing: spliced mappers
• General mappers • Explore the mapping results to spot

– Index lookup instead of direct unexpected trends early
sequence comparison
– Mapping QC tools
– Hash table indexing – seed methods
– IGV
– Burrow-Wheeler transform methods
(Suffix/prefix trees)
59
Extra Slides, Read Mapping
60
Burrow - Wheeler Transform (BWT) mappers: Index
reference
aaccaa aa cc a $
$
abaaba$
baaba$a
aaba$ab
aba$aba 61
ba$abaa
a$abaab
reference
acaaca$
acaaca$
caaca$a
aaca$ac
aca$aca 62
ca$acaa
a$acaac
$acaaca
$acaaca
a$acaac
reference aaca$ac
acaaca$ aca$aca
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
63
$acaaca
$acaaca
a$acaac
reference aaca$ac index
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac 64
$acaaca
$acaaca
a$acaac
aaca$ac
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
65
$acaaca
BWT mappers: Find T knowing BWT(T)
• We give each character a rank, equal to the times the character appear
a1 c1 a2 a3 c2 a4 $
• Transform produces a BWT matrix

$acaaca
a$acaac
aaca$ac
aca$aca
acaaca$
ca$acaa
caaca$a
BWT matrix 66
F L
$acaaca
a$acaac $ a1
aaca$ac
aca$aca a1 c1
acaaca$ a2 c2
ca$acaa
caaca$a a3 a2
BWT matrix a4 $
BWT matrices have a property called the Last First
(LF) Mapping: the ith occurrence of character c in
c1 a3
the last column corresponds to the same text c2 a4
character as the ith occurrence of c in the first
column. 67
$acaaca F L
a$acaac
aaca$ac
aca$aca $ a1
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix
a3 a2
a4 $
Find T knowing BWT(T):
c1 a3
T= $ c2 a4 68
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= a$ c2 a4
69
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
T= ca$ c2 a4
70
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
T= aca$ c2 a4
71
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
T= aaca$ c2 a4 72
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa a2 c2
caaca$a
BWT matrix
a3 a2
a4 $
T= caaca$ c2 a4
73
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
T=acaaca$ c2 a4
74
$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
T=acaaca$ c2 a4
75
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
76
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
77
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
78
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
79
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
80
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
81
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
the read position a3 c a $ a c a2
is found a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
82
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
83
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1 no C!
84
BWT mappers: Bowtie
• A mismatch in the search suffix -> empty BWT range/failed index lookup
• Sequencing error (Illumina 1/1000)
• True mutation (A. thaliana 7X10^-9 per site per generation)
• Mismatches are not rare event (at least 10% of >100 nt reads)
BWT mappers: Bowtie
Empty BWT range activates backtracking

• All different bases are tried at the mismatched position
• Chop reads in short segments (seeds), align those mismatch-free, stitch seed
alignments together
Spliced mappers: Seed extend
• Each read is divided into k-mers which

are mapped to the genome using table
lookup. Mapped k-mers are extended
into larger alignments which may
include gaps flanked by splice sites.
• Little bias, best placement of each read
• More tolerant to sequence differences
• GSNAP, PALMA
Garber M. et al. 2011. Nature Methods 8(6):469-477 87

An example of read mapping
Slide modified from Heng Li’s presentation 88

Corresponding SAM file
FLAG POS CIGAR? MPOS

QNAME RNAME MAPQ? MRNM ISIZE SEQ QUAL
Slide modified from Heng Li’s presentation 89
CIGAR string - compact representation of an alignment
• M - match or mismatch Match/mismatch,indels
• I – insertion Ref: ACGCAGTG—-GT

• D – deletion Read: ATGCA-TGCAGT
Cigar:5M1D2M2I2M
• S - soft clip
– Clipped sequences stored in SAM Soft clipping
• H - hard clip REF: ATCGTGTAACCTGACTAGTTAA

– Clipped sequences not stored in SAM READ: gggGTGTAACC-GACTAGgggg
• N – skipped reference bases, Cigar: 3S8M1D5M4S
splicing Hard clipping
REF: ATCGTGTAACCTGACTAGTTAA
READ: gggGTGTAACC-GACTAGgggg
Cigar: 3H8M1D5M4H 90
Mapping QC metrics: Transcript coverage
• How many transcripts are expressed/unexpressed

• Length coverage of expressed transcripts
A third mode in this

distribution would indicate that
there are genes covered only
partially
91
Mapping QC metrics: Transcript coverage bias
92
Mapping QC metrics: Transcript coverage bias
• If any part of the transcripts are under covered
3’ bias No bias 5’ bias
93
Mapping QC metrics: Junction saturation
• Depth needed for alternative splicing analysis

• All annotated splice junctions are detected - a saturated RNA-seq dataset
94
RNA-seq maping QC tools
• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC
• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/
• Qualimap
– http://qualimap.bioinfo.cipf.es/
95
Splice junction track
• Compute dynamically from junctions from the – strand junctions from the + strand
alignment data (must be from
spliced aligner)
• A splicing event is drawn when

at least a single read splits
across two exons
• The height and thickness of the

arc are proportional to the
coverage depth
96
Large indels and inter-chromosomal rearrangements
• Color alignment by insert size
97
Inversion, duplication, translocation
• Color alignment by read orientation
normal
inversion
inversion
duplication/t
ranslocation
98

3 RNAseq-Mapping LO

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 RNAseq-Mapping LO

Uploaded by

Copyright:

Available Formats

RNA-seq Bioinformatics:

Lennart Opitz, FGCZ

Experimental Design • General idea of mapping

Differential Expression Analysis

• Essential step when a reference genome is used

• Mapping outcomes: Spliced alignments

• Mapping outcomes: Uniquely mapped reads vs multi-reads

Uniquely mapped reads

• Reference (fasta format)

• When mapped to the genome, RNA-Seq reads can span junctions

Experimental Design • General idea of mapping

Differential Expression Analysis

• Quantity • Span junctions

• Exhaustive search to find the best local alignment

• Too slow: one CPU day per million reads

• High sensitivity not needed

• NGS mapping software

• Cut reference into small “seeds” (k-mers)

• Extend seeds to full alignments

• Shorter seeds -> higher sensitivity, but slower

Flicek P. and Birney E. 2009. Nature Method Supplement

Trapnell& Salzberg(2009) Nature Biotech 27, 455 -457

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 21

• Backward search exact matching

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 22

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 23

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 24

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 25

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 26

• Most NGS mapping software uses this indexing method

• For RNA-Seq Application it’s the BWT approach is mainly used

• Un-spliced mappers (DNA-Seq)

• Spliced mappers (RNA-Seq)

• Exonic reads are aligned first, the

• Fast, require less computational

• Bias towards un-spliced

• MapSplice, SpliceMap, TopHat

Garber M. et al. 2011. Nature Methods 8(6):469-477 30

• Ultra fast (does BWT as much as

• Start with BWT until a mismatch is

• Error tolerant (sensitive)

• Both short and long reads

• Default options may not be best

Experimental Design • General idea of mapping

Differential Expression Analysis

• SAM (Sequence Alignment/Map)

• BAM (Binary Alignment/Map)

• A score that indicates how well the read is aligned

• Mapper dependent, difficult to compute

• The overall base quality of the read is good.

• mapQ30 is usually required for SNP calling algorithms and detection of

• Reference with poor quality

Experimental Design • General idea of mapping

Differential Expression Analysis

• If the data pre-processing and mapping are done properly

• If the alignment results fit expected outcomes

• Capturing of technical issues

• How well did reads align to

• we aim for >70% unique alignments

Ø Hard trimming of first 2-3 bases

• Relative abundance of annotation features: intron, exon, up/down stream,

Ø Exclude degraded sample

Ø Repeat library prep

Experimental Design • General idea of mapping