You are on page 1of 98

RNA-seq Bioinformatics:

Read Mapping

Lennart Opitz, FGCZ


RNA-Seq Course 2020
Introduction Read Mapping

Experimental Design • General idea of mapping


RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts

Differential Expression Analysis

2
Mapping is to locate reads in the reference genome

• Essential step when a reference genome is used


• Many NGS applications:
– Quantity – expression quantification, differential analysis (RNA-Seq)
– Base differences – SNPs, SVs (DNA-Seq)
– Patterns – protein binding sites, methylation (ChIP-Seq)

3
Mapping reads to the reference genome

• Mapping outcomes:

4
Mapping reads to the reference genome

• Mapping outcomes:

5
Mapping reads to the reference genome

• Mapping outcomes: Spliced alignments

6
Mapping reads to the reference genome

• Mapping outcomes: Uniquely mapped reads vs multi-reads


– Repeats, paralogs
– Shared exons in transcriptome Multi-reads

Uniquely mapped reads


Mapping of RNA-seq reads

• Reference (fasta format)


– Genome
– Transcriptome
– Junction library

• When mapped to the genome, RNA-Seq reads can span junctions


– Spliced alignments
– Example:
- Mouse retina 60 nt paired reads
- 41 of 91 Mio map to junctions

8
Introduction Read Mapping

Experimental Design • General idea of mapping


RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts

Differential Expression Analysis

9
Mapping millions of reads is challenging

• Quantity • Span junctions


– 20-30 M reads per sample – Exon-exon
– Structural variations
• Mapping uncertainties
– Short reads ØMapping algorithm must be
– Sequencing errors, SNPs, Structural – Fast
variations – Able to handle SNPs, indels,
– No exact reference sequencing errors
– Allow for introns for reference
genome alignment

10
Smith-Waterman algorithm

• Exhaustive search to find the best local alignment

11
Smith-Waterman algorithm example: Where is GATTACA?

12
Smith-Waterman algorithm example: Where is GATTACA?

13
Smith-Waterman algorithm example: Where is GATTACA?

14
Smith-Waterman algorithm example: Where is GATTACA?

15
Smith-Waterman algorithm: limitation

• Too slow: one CPU day per million reads

• High sensitivity not needed


– Only exact and close to exact matches

16
How to map millions of short reads - index

• Index
– an alphabetical list of names, subjects, etc, with references to the places where they
occur
– Sorted, structured, allow fast search

• NGS mapping software


– Index reference file
– Index lookup, not direct sequence comparison
– Two main categories:
- Hash table mappers
- Burrow-Wheeler Transform (BWT) mappers

17
Hash table (seed) mappers

• Cut reference into small “seeds” (k-mers)


• Store seeds and their positions in a lookup table (hash index)
• Take part of the read and look up the hash table

• Extend seeds to full alignments


– Smith-Waterman
– Sensitive but slow

• Shorter seeds -> higher sensitivity, but slower

Flicek P. and Birney E. 2009. Nature Method Supplement


Li H. and Homer N. 2010. Briefs in Bioinformatics. .
Garber M. et al. 2011. Nature Methods
18
Hash table (seed) mappers: Features and tools

• Pros:
Ø Tolerant to mismatches
Ø Better at InDel detection
Ø More tolerant to sequence differences (Cross species mapping)

• Cons:
Ø High RAM needed for large genomes (50 Gb for human)
Ø Slower

• Tools:
• Eland, SOAP,MAQ, SHRiMP, GSNAP, …

19
Burrow-Wheeler-Transformation
• sort genome and index (BWT)
• align read base by base to find positions in the
genome

Trapnell& Salzberg(2009) Nature Biotech 27, 455 -457


20
BWT mappers: Bowtie

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 21


BWT mappers: Bowtie

• Backward search exact matching


• Searched suffix appears consecutively in BWT

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 22


BWT mappers: Bowtie

At each step:
• Searching suffix grows by one character
• The size of the range in BWT shrinks/remains the same

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 23


BWT mappers: Bowtie

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 24


BWT mappers: Bowtie

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 25


BWT mappers: Bowtie

Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 26


BWT mappers: Features and tools
• Pros:
Ø BWT mappers use as an index the BWT(ref)
Ø Little RAM is needed (2-8 Gb for human genome)
Ø Fast for searching perfect or close matches

• Cons:
Ø Performance decreases with distant matches
Ø Reads with higher error rate
Ø Samples are too distant from the reference genome

• Most NGS mapping software uses this indexing method

• Tools:
- STAR, BWA, bowtie, bowtie2, SOAP2, …

27
Choosing an Aligner

• For RNA-Seq Application it’s the BWT approach is mainly used

28
Un-spliced vs. spliced mappers

• Un-spliced mappers (DNA-Seq)


– Align continuous reads (not containing gaps as a result of splicing)
– When encountering an intron, the aligner stops and trims the rest of the read
– BWA, BOWTIE2, BLAST
– Splice junctions are impossible to detect

• Spliced mappers (RNA-Seq)


– Aware of the presence of introns
– When encountering an intron, the aligner does not stop to trim the rest of the read but
continues to find the next exon
– TopHat, STAR, GMAP, BLAT

29
Spliced mappers: Exon first

• Exonic reads are aligned first, the


remaining reads are divided into
smaller pieces and then mapped to
the genome

• Fast, require less computational


resources

• Bias towards un-spliced


alignments

• MapSplice, SpliceMap, TopHat

Garber M. et al. 2011. Nature Methods 8(6):469-477 30


Spliced mappers: STAR

• Ultra fast (does BWT as much as


possible)

• Start with BWT until a mismatch is


encountered, then store the possible
alignment candidates, and start a new
BWT from scratch instead of doing
MMP = Maximal Mappable pieces
Smith-Waterman search.

• Error tolerant (sensitive)

• Both short and long reads


Dobin A. et. al. 2012. Bioinformatics. doi: 10.1093/bts635 31
Choosing an alignment tool…

• Default options may not be best

”… there is no tool that outperforms all of the others in all the tests. Therefore,
the end user should clearly specify his needs in order to choose the tool that
provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184

32
Introduction Read Mapping

Experimental Design • General idea of mapping


RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts

Differential Expression Analysis

33
Sequence / Alignment (SAM) files

• SAM (Sequence Alignment/Map)


– Single unified format for storing read alignments to a reference genome
– Large plain text file
– RNA-Seq: >10Gb
– Exome: >50Gb, Whole Genome: 0.8-1Tb

• BAM (Binary Alignment/Map)


– Binary equivalent of SAM
– Compressed data plus index (bai)
– Developed for fast processing/indexing
– RNA-Seq: >2Gb
– Exome: 2-10Gb, Whole Genome: 100-300Gb

34
SAM/BAM Header

35
SAM/BAM read information

36
SAM Format Read information

37
SAM Format Read Information

38
SAM Format FLAG Values

39
Mapping quality score

• A score that indicates how well the read is aligned


• Probability that a read is mapped incorrectly

• Phred scaled
– mapP = probability that a read is mapped incorrectly
– mapQ = -10log10mP
– Q30: 1 in 1000 incorrect
– Ranger from 1 to 254
– The higher, the better
– 255: mapping quality not available, but for unique alignments (STAR)

• Mapper dependent, difficult to compute

40
Meaning of mapQ30

• The overall base quality of the read is good.

• The read has few or just one `good' hit on the reference -> best alignment
can be easily identified

• The best alignment has few mismatches -> actual mutations or sequencing
errors.

• mapQ30 is usually required for SNP calling algorithms and detection of


structural variations

41
Cause of poor mapQ

• Poor quality reads (Low base quality -> low mapping quality)
• Paired end reads or not (Reads mapped in pairs -> more likely to be correct)

• Reference with poor quality


• High divergence between the sequenced population and reference
• Repeats (Reads in repetitive regions -> very low mapping quality)

• Poor choice of the mapping software (An algorithm with low sensitivity ->
more mapping errors)
• Improper alignment parameters

42
Introduction Read Mapping

Experimental Design • General idea of mapping


RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts

Differential Expression Analysis

43
Why QC mapping results

• If the data pre-processing and mapping are done properly


– Extremely low mapping rate for within species read mapping

• If the alignment results fit expected outcomes


– abundance of genomic features

• Capturing of technical issues


– Sample degradation
– Over-amplification issues (Duplication rate)

44
Mapping QC metrics: Summary stats

• How well did reads align to


the reference?

• Summary statistics
– % reads with no alignment
– % reads with unique alignment
– % reads with multiple
alignments

• we aim for >70% unique alignments


• Samples with lower mapping rates
should be investigated (contamination
vs. quality issues) 45
Read position specific error rate

Good Suboptimal

Ø Hard trimming of first 2-3 bases


46
Mapping QC metrics: Abundance of genomic features

• Relative abundance of annotation features: intron, exon, up/down stream,


unannotated
• Are there samples with different abundance of certain features?

47
Sample degradation check: Transcript Coverage Bias

Good Suboptimal

Ø Exclude degraded sample


48
Overamplification check: Duplication rate QC

Good Bad

Ø Repeat library prep


49
RNA-seq maping QC tools

• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC

• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/

• Qualimap
– http://qualimap.bioinfo.cipf.es/

50
Introduction Read Mapping

Experimental Design • General idea of mapping


RNA
• Different strategies for mapping
Sequencing
fastq
• Mapping file formats
Data Quality Control
fastq
Reference
• Mapping quality control
Genome
Read Mapping
fasta
• Visualization of mapping data
SAM/BAM
Reference
Read Transcriptome
Quantification
GFF/GTF
counts

Differential Expression Analysis

51
How to visualize mapping results

• Using a genome browser


– Software enable users to browse multiple data types and annotations in the context of
the genome

• UCSC and Ensembl browser


– Reference genome should be hosted on the UCSC/Ensembl server
– Upload customized data files or import via URL

• Integrative Genomics Viewer (IGV) (Broad institute)


– Open source
– Well maintained, actively developed
– Platform independent, easy to use

52
Reference track
Mapped reads in IGV

Loaded
Reads
BAM file

Annotation track
Splice junctions
53
SNPs

54
Single base deletions

55
Single base insertions

56
Sashimi plots

• Quantitatively visualize splice junctions Coverage


Splice junctions

number of split
reads

different color
differentiate
samples

57
Interactive mapping QC using IGV

• Are my data behaving as expected


– No reads mapped to the knock-out site

58
Read mapping: Summary

• Different mapping tasks • Spliced mappers


– No splicing: general mappers – Build upon general mappers
– Splicing: spliced mappers

• General mappers • Explore the mapping results to spot


– Index lookup instead of direct unexpected trends early
sequence comparison
– Mapping QC tools
– Hash table indexing – seed methods
– IGV
– Burrow-Wheeler transform methods
(Suffix/prefix trees)

59
Extra Slides, Read Mapping

60
Burrow - Wheeler Transform (BWT) mappers: Index

reference
aaccaa aa cc a $
$

abaaba$
baaba$a
aaba$ab
aba$aba 61
ba$abaa
a$abaab
Burrow - Wheeler Transform (BWT) mappers: Index

reference
acaaca$

acaaca$
caaca$a
aaca$ac
aca$aca 62
ca$acaa
a$acaac
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index

$acaaca
a$acaac
reference aaca$ac
acaaca$ aca$aca
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
63
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index

$acaaca
a$acaac
reference aaca$ac index
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac 64
$acaaca
Burrow - Wheeler Transform (BWT) mappers: Index

$acaaca
a$acaac
aaca$ac
acaaca$ aca$aca acca$aa
acaaca$
ca$acaa
caaca$a
acaaca$
caaca$a
aaca$ac
aca$aca
ca$acaa
a$acaac
65
$acaaca
BWT mappers: Find T knowing BWT(T)

• We give each character a rank, equal to the times the character appear

a1 c1 a2 a3 c2 a4 $

• Transform produces a BWT matrix


$acaaca
a$acaac
aaca$ac
aca$aca
acaaca$
ca$acaa
caaca$a
BWT matrix 66
BWT mappers: Find T knowing BWT(T)
F L
$acaaca
a$acaac $ a1
aaca$ac
aca$aca a1 c1
acaaca$ a2 c2
ca$acaa
caaca$a a3 a2
BWT matrix a4 $
BWT matrices have a property called the Last First
(LF) Mapping: the ith occurrence of character c in
c1 a3
the last column corresponds to the same text c2 a4
character as the ith occurrence of c in the first
column. 67
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac
aca$aca $ a1
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix
a3 a2
a4 $
Find T knowing BWT(T):
c1 a3
T= $ c2 a4 68
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= a$ c2 a4
69
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= ca$ c2 a4
70
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= aca$ c2 a4
71
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= aaca$ c2 a4 72
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa a2 c2
caaca$a
BWT matrix
a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T= caaca$ c2 a4
73
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T=acaaca$ c2 a4
74
BWT mappers: Find T knowing BWT(T)

$acaaca F L
a$acaac
aaca$ac $ a1
aca$aca
acaaca$ a1 c1
ca$acaa
caaca$a a2 c2
BWT matrix a3 a2
a4 $
Find T knowing BWT(T): c1 a3
T=acaaca$ c2 a4
75
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
76
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
77
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
78
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
79
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
80
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
81
BWT mappers: Example
reference: a caaca F L
read: a ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
the read position a3 c a $ a c a2
is found a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
82
BWT mappers: Example
reference: a caaca F L
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1
83
BWT mappers: Example
reference: a caaca F L
read: c ca
$ a c a a c a4
a4 $ a c a a c2
a2 a c a $ a c1
a3 c a $ a c a2
a1 c a a c a $
c2 a $ a c a a3
c1 a a c a $ a1 no C!

84
BWT mappers: Bowtie

• A mismatch in the search suffix -> empty BWT range/failed index lookup
• Sequencing error (Illumina 1/1000)
• True mutation (A. thaliana 7X10^-9 per site per generation)
• Mismatches are not rare event (at least 10% of >100 nt reads)
Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 85
BWT mappers: Bowtie

Empty BWT range activates backtracking


• All different bases are tried at the mismatched position
• Chop reads in short segments (seeds), align those mismatch-free, stitch seed
alignments together
Slides courtesy of Ben Langmead (langmead@umiacs.umd.edu) 86
Spliced mappers: Seed extend

• Each read is divided into k-mers which


are mapped to the genome using table
lookup. Mapped k-mers are extended
into larger alignments which may
include gaps flanked by splice sites.

• Little bias, best placement of each read

• More tolerant to sequence differences

• GSNAP, PALMA

Garber M. et al. 2011. Nature Methods 8(6):469-477 87


An example of read mapping

Slide modified from Heng Li’s presentation 88


Corresponding SAM file

FLAG POS CIGAR? MPOS


QNAME RNAME MAPQ? MRNM ISIZE SEQ QUAL
Slide modified from Heng Li’s presentation 89
CIGAR string - compact representation of an alignment
• M - match or mismatch Match/mismatch,indels

• I – insertion Ref: ACGCAGTG—-GT


• D – deletion Read: ATGCA-TGCAGT
Cigar:5M1D2M2I2M
• S - soft clip
– Clipped sequences stored in SAM Soft clipping

• H - hard clip REF: ATCGTGTAACCTGACTAGTTAA


– Clipped sequences not stored in SAM READ: gggGTGTAACC-GACTAGgggg
• N – skipped reference bases, Cigar: 3S8M1D5M4S
splicing Hard clipping

REF: ATCGTGTAACCTGACTAGTTAA
READ: gggGTGTAACC-GACTAGgggg
Cigar: 3H8M1D5M4H 90
Mapping QC metrics: Transcript coverage

• How many transcripts are expressed/unexpressed


• Length coverage of expressed transcripts

A third mode in this


distribution would indicate that
there are genes covered only
partially

91
Mapping QC metrics: Transcript coverage bias

92
Mapping QC metrics: Transcript coverage bias

• If any part of the transcripts are under covered

3’ bias No bias 5’ bias

93
Mapping QC metrics: Junction saturation

• Depth needed for alternative splicing analysis


• All annotated splice junctions are detected - a saturated RNA-seq dataset

94
RNA-seq maping QC tools

• RNA-SeQC
– https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC

• EVER-seq (RSeQC)
– http://code.google.com/p/rseqc/

• Qualimap
– http://qualimap.bioinfo.cipf.es/

95
Splice junction track

• Compute dynamically from junctions from the – strand junctions from the + strand
alignment data (must be from
spliced aligner)

• A splicing event is drawn when


at least a single read splits
across two exons

• The height and thickness of the


arc are proportional to the
coverage depth

96
Large indels and inter-chromosomal rearrangements

• Color alignment by insert size

97
Inversion, duplication, translocation

• Color alignment by read orientation

normal

inversion

inversion

duplication/t
ranslocation

98

You might also like