You are on page 1of 29

talks

Base Quality Score Recalibra2on

Assigning accurate confidence scores


to each sequenced base
You are here in the GATK Best Prac2ces workflow
for germline variant discovery

Data Pre-processing >> Variant Discovery >> Callset Refinement

Raw Reads
111 Analysis-Ready Var. Calling 111 Analysis-Ready SNPs
Reads HC in ERC mode Variants & Indels

Map to Reference Genotype Likelihoods


Non-GATK

BWA mem
Genotype
Mark Duplicates Refinement
Variant
& Sort (Picard) Joint Genotyping
Annotation

Indel Realignment Raw Variants SNPs Indels


Variant Evaluation
Base Recalibration
Variant Recalibration look good?
separately per variant type
Analysis-Ready
Reads
Analysis-Ready
SNPs Indels troubleshoot use in project
Variants
Real data is messy -> properly es2ma2ng the evidence is cri2cal
Quality scores issued by sequencers
are inaccurate and biased

•  Quality scores are cri2cal for all downstream analysis


•  Systema2c biases are a major contributor to bad calls

Example of bias: quali2es reported depending on nucleo2de context


RMSE = 4.188 RMSE = 0.281
10

10
5

5
Empirical − Reported Quality

Empirical − Reported Quality


0

0
−5

−5

original recalibrated
−10

−10

AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG

Dinuc Dinuc
How do we identify the error modes in the data?

RMSE = 4.188

•  Systema2c errors correlate with

10
basecall features

5
Empirical − Reported Quality
•  Several relevant features:

0
–  Reported quality score
–  Posi2on within the read

−5
(machine cycle)
–  Sequence context

−10
(sequencing chemistry effects)
AA AG CA CG GA GG TA TG

Dinuc

•  Calculate error empirically and find paPerns in how error varies with
basecall features

•  Method is empowered by looking at en2re lane of data


(works per read group)
How do we calculate the empirical quali2es?

•  Any sequence mismatch = error except known variants*!

•  Keep track of number of observa2ons and number of errors


as a func2on of various error covariates

(lane, original quality score, machine cycle, and sequencing context)

# of reference mismatches +1 PHRED-scaled


# of observed bases + 2 quality score

* If you don’t have known varia:on, bootstrap (see later on)


Applying recalibration is simple

For each base in each read:

- - is it in AA context? -> adjust by X points


- - ...
- - is it at 3rd posi2on? -> adjust by Y points
- - ...
Highlighted as one of the
Different sequencing technologies
major methodological have different error modes
advances of the 1000
Genomes Pilot Project!

SLX GA 454 SOLiD Complete Genomics HiSeq


40

40

40

40

40
● ●
●● ●
●● ●● ●● ●●
●●
● ● ●● ● ●
●● ●
● ●● ● ●●

● ●● ●●
● ● ●● ● ●
●● ●●● ●● ●● ● ●●

●● ●●
30

30

30

30

30
● ● ● ●● ● ●
Empirical Quality

Empirical Quality

Empirical Quality

Empirical Quality

Empirical Quality
●● ●● ● ●

● ● ● ● ●●

● ●●●●●●● ●● ● ● ● ●● ● ●
● ●● ●● ●●● ● ● ●
● ●●
● ●● ● ● ●●●●●●●
● ●● ● ● ● ● ●● ●● ●●
● ● ●
● ●●● ●
● ● ● ●● ● ●
● ●● ● ●●● ● ●
20

20

20

20

20
● ● ●● ●●
●● ●● ● ●● ●
● ● ●● ●● ● ●● ●● ●●
●● ●● ●● ● ● ● ●●
● ● ●● ● ● ●
● ●● ● ● ● ●● ●●
● ●● ● ● ●
● ●● ● ● ● ● ●
●● ● ● ●

● ● ●
● ● ●● ●
●● ●●● ● ●●
10

10

10

10

10
● ●● ● ●●●
● ● ●
● ● ● ● ● ● ●●
●● ● ● ● ●●●●
●●
●● ● ● ● ●●● ● ●
●● ● ● ●

●●
● Original, RMSE = 5.242 ● Original, RMSE = 2.556 ● Original, RMSE = 1.215 ● Original, RMSE = 4.479 ● Original, RMSE = 5.634
●●●
● Recalibrated, RMSE = 0.196 ● Recalibrated, RMSE = 0.213 ● Recalibrated, RMSE = 0.756 ● Recalibrated, RMSE = 0.235 ● Recalibrated, RMSE = 0.135
0

0
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Reported Quality Reported Quality Reported Quality Reported Quality Reported Quality
Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)


10

10

10

10

10
● Original, RMSE = 2.207 ● Original, RMSE = 1.784 ● Original, RMSE = 1.688 ● Original, RMSE = 2.679 ● Original, RMSE = 2.609
● Recalibrated, RMSE = 0.186 ● Recalibrated, RMSE = 0.136 ● Recalibrated, RMSE = 0.213 ● Recalibrated, RMSE = 0.182 ● Recalibrated, RMSE = 0.089
5

5
●●●● ●●
● ●
●●●
● ●●●●● ●●●●● ●●●● ● ●

●●●●●●●●●● ●●● ● ●● ● ● ●●●● ● ●●●●●●●●●●●●●●●
● ●
● ●● ●●● ●● ● ● ●●● ● ●●●
●● ●●● ● ● ● ●

● ●

●●

●●

●●

●●●
●●●

●●
● ●
●●●●
● ●●●●●●● ● ●● ●●● ●● ● ●

●●
●●●
●●

● ●

●●

● ●
●● ●●

●●

●●

●●

●●

●●

●●

●●

●●
●●
●●●

●●●●

●● ●
●●●
●●●●●●●●●● ● ●●● ●● ●●● ●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●
● ●● ●●●● ●●●●●●● ● ● ● ● ●
●●

●●

●●

●●

●●

●●

●●
●●
●●● ●●
●●●
●●

●●●●

●●
● ●
●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●
●●

●●

●●

●●

●●●

●●

●●●

●●

●●
● ●● ● ●●●● ●● ●● ●●● ●●● ●●●●●●●
●●●●●●●
●● ● ●
●●●●●● ● ●●●● ●●●●●●● ●● ●● ●●

●●●●●●●
●●
●●
●●

●●
●●
●●
●●
●●

●●
●●
●●●
●●

●●●●
●●

●●
●●●
●●●
●●●●

●●●

●●
●●

●●
●●

●●●
●●●
●●
● ●
●● ● ●
0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●
0


0

0
● ●●●
●●●
●●

● ●
●● ● ●
●● ●●●
● ●● ●
●● ●
●●
●●●
●●
● ● ● ● ● ● ●
●● ●●●
●●
●●
●●
●●

●●

●●

●●

●●

●●
●●
●● ●

●● ●
●●
●●●
●●
●●
●● ●

●●

●●
●●
●●

●●

●●
●●
●●
●●
●●

●●

●●
●●

●●
●●
●●


● ●
●●


●●


●●
● ●●●

●●

●●

●●
● ●●
●●● ●● ● ● ● ● ●●●●●●●●●●●●● ● ●●●● ●●● ● ● ●● ● ● ●●
●● ● ● ●●● ● ●
● ●●


●●● ●●
● ● ●● ● ●
● ●●
●●●
●●

●●

● ●●

●●
●●●
●●
● ●●●● ● ●●● ●● ●●●
● ●●●●● ●●● ● ● ●● ●
●●
●● ●
●●●●

●● ●●●

●●●●
● ● ●
●●●●●●
●●
●●
●●●●
● ●●●●● ●●●● ● ● ● ● ●●●●●● ●●● ●●● ●
● ●● ●●●●●●●●●●●●●●●●
● ●●
● ●●

●●

●● ●
●●

●●

●●
●●

●●

●●●●
● ●● ●●● ●● ● ● ● ● ● ●● ●●●

●●●
●●●
● ●●
●● ●●●●● ●●●

● ● ● ● ●
●●●

●● ●●●
●●●●● ●
●●



●●●●●●●● ●●●
● ●● ●

●●

●●

●●●

● ●
●●
●● ●

●●● ● ●●● ●
●●
●●● ●● ● ● ●
● ● ● ●●●●● ●● ●
●●●● ● ●● ● ●●

● ●● ●● ●
● ● ●● ●● ●

●●
● ●● ●● ●● ●●● ●●
−5

−5

−5

−5

−5
● ●
●● ●
● ● ●●
second of pair reads first of pair reads second of pair reads first of pair reads second of pair reads first of pair reads

−10

−10

−10

−10

−10
0 5 10 15 20 25 30 35 0 50 100 150 200 −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 −100 −50 0 50 100
Machine Cycle Machine Cycle Machine Cycle Machine Cycle Machine Cycle
Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)


10

10

10

10

10
Original, RMSE = 2.598 Original, RMSE = 2.169 Original, RMSE = 1.656 Original, RMSE = 3.503 Original, RMSE = 2.469
Recalibrated, RMSE = 0.052 Recalibrated, RMSE = 0.135 Recalibrated, RMSE = 0.088 Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083


− −− −−− −− −
5

5
− −− − − − −−−−− −−

−−−−−−−−−−−−−−−−
− −−−−−−−−−−−−−−−− −−−−−
− −−−−−−−−−− −
−− −−−−−−−−−−−−−−−
−− − −−−−−
− −−−
−−−−−−−−−
−− − −−−−−−−−− − − − − − − −−− −− −−−−
0

0
−− −−− −− − −− −
− − − − − −

−5

−5

−5

−5

−5
− − −
−10

−10

−10

−10

−10
AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG AA AG CA CG GA GG TA TG
Dinucleotide Dinucleotide Dinucleotide Dinucleotide Dinucleotide
Per-base indel error rates are higher with some sequencing technologies

AAAAA context

● ● ● ●
● ● ● ●

● ● ● ● ●

● ●

50 ●




● ● ● ● ●
● ● ●
● ● ●
● ●

● ● ●
● ●




● ● ● ● ● ● ●
● ● ● ● ●





● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ●


● ●
● ●

● ● ● ●

● ● ●
● ●
● ●




● ● ●



● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ●
● ●

● ● ● ●

● ● ●

● ● ●
● ●
● ●


● ● ●


● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ●


● ● ●

● ●
● ●
● ●
● ● ●




● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●

● ● ●
● ● ●
● ●
● ●
● ● ● ● ● ● ●
● ●


● ● ●


● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●


● ●
● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ● ●

● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
40 ●




● ●





● ● ●


● ●
● ●



● ●


● ReadGroup

● ●
● ● ● ●



Empirical gap open penalty


● ● ● ● ● ● ●

● ● ● ●


● ●
● ●
● ● ●



● 20FUK.1

● 20FUK.2
30 ●
● ● 20FUK.3
● 20FUK.4

HiSeq ● 20FUK.5
20 ● 20FUK.6
● 20FUK.7


● 20FUK.8
● ●
10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●

● ● ● ●

● ●


● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●
● ● ● ● ●
● ● ● ● ●
● ● PacBio

0
PacBio
GGG
CGG

GCG

GGC
CCG

CGC

GCC

GGA
AGG

GAG
CCC

CGA

GCA

GGT

GTG

TGG
ACG

AGC

CAG

GAC
CCA

CGT

CTG
GAA

GCT

GTC

TCG

TGC
AAG

ACC

AGA

CAC
CAA

CCT

CTC

TCC

TGA
AAC

ACA

AGT
AAA

GTT

TCA

TGT

TTG
ACT

CTT

TCT

TTC
ATG

GAT

GTA

TAG
ATC

CAT

CTA

TAC

TTT
AAT

TAA
ATT

TTA
ATA

TAT
suffix

Per-base indel error estimates are required for accurate indel calling on
9 new technologies with indel-rich error model such as Pacific Biosciences.
Base Recalibra2on steps/tools

•  Make before/
•  Model the error modes and
aaer plots
recalibrate quali2es

➔  AnalyzeCovariates

➔  BaseRecalibrator

•  Write the recalibrated data to


file

➔  PrintReads

Two complementary paths: data processing and ploEng

BaseRecalibrator (1)

PrintReads BaseRecalibrator (2)

AnalyzeCovariates

RECALIBRATED BAM
BEFORE

AFTER

PLOTS
Data processing path

Original BAM file + Known sites

BaseRecalibrator (1)

Recalibra2on table

PrintReads

Recalibrated BAM file


Processing step 1: build the model

Original BAM file + Known sites

BaseRecalibrator (1)

Recalibra2on table

PrintReads

Recalibrated BAM file


TOOL TIPS
BaseRecalibrator

•  Builds recalibra2on model


java –jar GenomeAnalysisTK.jar –T BaseRecalibrator \
–R human.fasta \
–I realigned.bam \
–knownSites dbsnp137.vcf \
–knownSites gold.standard.indels.vcf \
–o recal.table

The recalibra2on table contains the adjustment factors

Original BAM file + Known sites

BaseRecalibrator (1)

Recalibra2on table

PrintReads

Recalibrated BAM file


Processing step 2: write the recalibrated BAM file

Original BAM file + Known sites

BaseRecalibrator (1)

Recalibra2on table

PrintReads

Recalibrated BAM file


TOOL TIPS
Print Reads

•  General-use tool co-opted with –BQSR flag and fed a recalibra2on report


java –jar GenomeAnalysisTK.jar –T PrintReads \
–R human.fasta \
–I realigned.bam \
–BQSR recal.table \
–o recal.bam

•  Creates a new bam file using the input table generated previously which has
exquisitely accurate base subs2tu2on, inser2on, and dele2on quality scores

•  Original quali2es retained with OQ tag



The recalibrated BAM file is ready for downstream processing

Original BAM file + Known sites

BaseRecalibrator (1)

Recalibra2on table

PrintReads

Recalibrated BAM file


This is what a recalibrated BAM record looks like

Recalibrated Base Quali:es

ACCTTCCCCCAGCCCCTACCCCCAGACAGGCCCCGGTGTGTTGTGTTCCCCT
CCCTCTGTCCATGTGTTCTCATTGTTCAACTCTCATTTATGAGTGAGAACAT
CGGGGGTTTGGTTTTCTGTTCTTGGATTAGTTTGGTGAGAATGATGG
<;<>==>=>>6>=>>>??+<>>>?3::*<>8=>>8?/=.
3/7;<<;>=???>???@=1=>=?+=>?=.<=A@;??,>?=;4:?>1>
+>=?:@=>?/;4??<@+??9<;+8/<-,?:<@>:@=/-.@>=@9/?)=6???
+:@=B=####### MC:Z:151M MD:Z:108T29C12
PG:Z:MarkDuplicates.4 RG:Z:H01PE.2 NM:i:2
MQ:i:0 OQ:Z:AAFFAFJFJJ<FFJJJJJ-AJJJJ7AA-
AJ<FJJJJ-F-7-<AAAAJFJJJFJJJJF-FFFJ-FFJF-FFJJAJJ-
FJAA7AAF-F-FFJAJAFF-A7FFAJ-FFFAA-<-A--F<AJF<FA---
AFAF<-F-A7FFF-<FAJA####### UQ:i:24 AS:i:141

Original Base Quali:es


We already did the first recalibra2on

Original BAM file


+ known sites

Already done in
BaseRecalibrator (1) data processing path

Recalibra2on table

BaseRecalibrator (2) AnalyzeCovariates

Recalibra2on table (2) Plots


Now we build the “aaer” model to evaluate remaining error

Original BAM file


+ known sites

BaseRecalibrator (1)

Recalibra2on table

BaseRecalibrator (2) AnalyzeCovariates

Recalibra2on table (2) Plots


TOOL TIPS
Base Recalibrator (2)

•  Second pass evaluates what the data looks like aaer


recalibra2on


java –jar GenomeAnalysisTK.jar –T BaseRecalibrator \
–R human.fasta \
–I realigned.bam \
–knownSites dbsnp137.vcf \
–knownSites gold.standard.indels.vcf \
–BQSR recal.table \
–o aUer_recal.table

The second recalibra2on table contains the remaining error

Original BAM file


+ known sites

BaseRecalibrator (1)

Recalibra2on table

BaseRecalibrator (2) AnalyzeCovariates

Recalibra2on table (2) Plots


The two recalibra2on tables are used to produce before/aaer plots

Original BAM file


+ known sites

BaseRecalibrator (1)

Recalibra2on table

BaseRecalibrator (2) AnalyzeCovariates

Recalibra2on table (2) Plots


TOOL TIPS
AnalyzeCovariates

•  Makes plots based on before/aaer recalibra2on tables


java –jar GenomeAnalysisTK.jar –T AnalyzeCovariates \
–R human.fasta \
–before recal.table \
–aUer aUer_recal.table \
–plots recal_plots.pdf

•  There is an op2on to keep the intermediate .csv file used for plokng, if you want
to play with the plot data.
The plots allow us to evaluate the extent and effec2veness of recalibra2on

Original BAM file


+ known sites

BaseRecalibrator (1)

Recalibra2on table

BaseRecalibrator (2) AnalyzeCovariates

Recalibra2on table (2) Plots


● ● ●● ●
● ● ● ● ● ●

Em

Em

Em

● ● ● ●
●●
●● ● ● ● ●●

10

10

10
●●● ●●● ●
● ● ● ●●
● ● ● ●●●●
● ● ●●● ● ●
● ● ●●●

Did the recalibra2on work properly?


ginal, RMSE = 2.556 ● Original, RMSE = 1.215 ● Original, RMSE = 4.479 ● Original, RMSE = 5.634
●●●
calibrated, RMSE = 0.213 ● Recalibrated, RMSE = 0.756 ● Recalibrated, RMSE = 0.235 ● Recalibrated, RMSE = 0.135

0
30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
Quality Reported Quality Reported Quality Reported Quality

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)


10

10

10
40

40

ginal, RMSE = 1.784 ● Original, RMSE

= 1.688 ● Original, RMSE = 2.679 ● Original, RMSE = 2.609

Post-recalibra2on quality scores should fit the


calibrated, ●
●RMSE = 0.136 ●● RMSE = 0.213
Recalibrated, Recalibrated,
●●

● ●

●● RMSE = 0.182 ● Recalibrated, RMSE = 0.089
●●

● ●● ●●
● ●
●●● ●● ●● ●●
●● ●●
30

30
● ●
5

5
● ● ●
Empirical Quality

Empirical Quality
● ● ● ● ●

empirically-derived quality scores very well



● ● ●● ●●●● ●●● ● ●●●
● ● ● ●●●●● ●●●●● ●●●● ●●● ● ●
●●● ●● ● ●
● ●●●●●●●● ● ●● ● ● ●●●●● ●●●● ● ●● ●
● ●● ●● ●●●

●● ● ●●●● ●●●● ●● ●● ● ●● ●●● ● ●●● ● ●●●
●●
●●●●●●●●●●●● ●● ● ● ● ● ● ●


●●
● ●●● ● ●● ●●● ●● ● ● ●
●●●●
●●●
●● ● ●
●●●

●●●
●●

●●●

●●

●●

●●

●●
●● ●●●●● ● ● ●
20

20


●●

●●●

●●

●●

●●


●●

●●

●●

●●

●●


●●

●●

●●

●●

●●

●●● ●
●●●
●●● ●
●●●●●●●●●● ●●● ●●●●●● ● ●●●● ●●● ●●●●● ●●●●●●●● ●●● ●●●●●●●●●● ●● ● ●●●●●●●●●●●●●● ●● ●
●●●

●●
●●●
●●
●●
●●●
●●

●●●●
●●

●●●●
●●●●●●●
●●●
●●
●●

●●●
●●●

● ●
●●●●●●● ● ●●
● ●●●●
●●
●●●
● ●●
● ● ● ●●●●
● ●●● ● ●●
● ●
0

0
●●
●●

●●
● ●●●●
●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●
● ●●●● ●● ● ●●● ● ●● ● ● ●●● ●

●●
● ●
●●
●●

●●
●●● ● ● ● ● ●
●● ●
●●●

●●

●●

●●

●●

●●

●●

●●

●●


●●●
●●●●
●●●

●●
● ●
●●●

●●

●●

●●
●●

●●

●●

●●

●●
●●

●●

●●

●●
●●

●●
●●

●●


●●
●● ● ● ● ●● ● ● ●● ●● ●●
●● ●●
●●●
● ●●


●●
●●

● ●●●
●●●
●●●●●●●
● ●
● ●●● ● ● ●●●● ●●● ●●●
● ●●●●
●●●●●
●●●● ● ● ● ●● ●
●● ●●●●●
●●●●
● ● ● ●●● ●● ● ●
●● ●● ●
●● ●●●● ● ● ● ● ●● ● ●
●● ●●● ●● ● ● ● ● ● ●
● ● ●●
●●●●●●
●●●●●●●● ● ●
●●

●●

●● ●
● ●●●

●●
●● ● ●●

● ● ● ●
●● ● ●●●● ●●● ●● ● ●●●
●●
●●●●●
●●●●●●●●●●
●●●●
● ●●●●●● ●●
●●● ●●●●● ● ●●● ● ● ●●●● ●
● ●●

● ● ● ●●●● ●● ● ●●
● ●
●●
● ● ●
● ●● ●● ●● ●● ●●● ●●
●●●
−5

−5

−5
10

10
● ● ●

● ● ● ●● ●● ●

=> no obvious systema2c biases should remain


● ● ●
●● ●● ●
●●● ● ● ● ●
● ●●●

E = 1.215 ●
● Original, RMSE = 4.479 ● Original, RMSE = 5.634
−10

−10

−10
RMSE = 0.756 ● Recalibrated, RMSE = 0.235 ● Recalibrated, RMSE = 0.135
0

0
0
Cycle
15040
200 0 −30
10 −20 20−10
ReportedMachine
Quality Cycle
0 30 10 20
40 30 0 −30
10 −20 20−10
ReportedMachine
Quality Cycle
0 30 10 20
40 30 −100 −50
Machine Cycle
0 50 100
Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)


Accuracy (Empirical − Reported Quality)

Accuracy (Empirical − Reported Quality)


10

10

10
10

10
40


ginal,
E●
RMSE = 2.169
= 1.688 ● Original,
Original, RMSE RMSE = 1.656
= 2.679 ● Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469

calibrated,
RMSE RMSE = 0.135
= 0.213 Recalibrated,
Recalibrated, RMSE = 0.088 Recalibrated,Recalibrated, RMSE = 0.06 Recalibrated, RMSE = 0.083
●RMSE
● = 0.182 RMSE = 0.089
● ●

●●

●●
●●
●●
− −− −
30

− −−

5

5
5

− − −

Empirical Quality

● ●●
● ●
●●●● ●● ●
● ●●
●●●
● ●●●●● ●●●●● ●●●●

− − ● ●● ● ● ●●●● ●● ●●● ●●●● ●

−−−−
●●●●●●●●●●●●

− −−−−−
● ●

−−

−− −−−−−
−−
● ●●● ● ●●●

−●●

−−−−−−−−

● ●
●●● ●● ●
● ●
● ●

− −−−− − −
−−−−−

− −−−−−
−−−−−− −
− −−−

−−−−−
−−−−−−−−−
− −−−−
●●● ●● ●


20

−−
●● ● ● ● ●●● ●●●● ●● ●
● ●
● ●
● ●

●●●●
●●
●● ●●
●● ●●●●
●●●●●●
● ●●
●●● ●
●●●●●●●
●●●●●●● ●● ● ● ● ● ● ● ● ●●

●● ●●●
●●●

●●●
● ●

● ●
● ●●●

●●●●
●●●
● ●
● ● ●
●●● ● ● ● ●

−−−−
● ● ●
0

● ●

0
● ● ● ● ● ●
0

● ● ●
0

● ● ●● ● ● ●


●●●●●● ●
●●●● ● ● ●●● ● ●
●●●● ● ●●●●●●● ●●● ● ●● ●●●●●● ●●●●●●
●●
●●● ● ●●
● ● ●

●●●
●●
●●

●●

●●

●●

●●

●●
●●
●●
●●●
●●●●
●●
●●
●●●●
●●●

●●

●●
●●
●●

●●
●●

●●
●●●
●●

●●
●●
●●
●●
●●
●●

−− − −
●●●● ● ● ● ●● ● ● ●●
● ● ●●●●●●


● ● ● ●● ● ●● ● ●●●● ●● ●●●●●●
● ●●●●●●● ●●●
●●● ●●● ●●
●●● ●
●● ●●●●● ●


● ● ●●●


● ● ● ●●
● ●
● ●● ● ●● ● ● ●● ●● ●●●
● ●
●●
● ● ●●●
●●●●●●
● ●●●●● ●●●● ● ●●●●●

● ●●●
●●●●●●● ●●●●●●● ●●●●●● ●
●●● ●


− −
●●● ●● ●●●
● ●
●● ● ● ●●● ●
● ● ● ●●●●● ● ●

● ● ● ●● ● ●●
● ●
●●●●



● ● ●● ●
● ●● ●● ●● ●●● ●●


● ●●
−5

−5

−5
−5

−5
10

● ●
● ●● ●● ●
● ●
● ● ● ● ●●
9


● ●
●●
● Original, RMSE = 5.634 −
−10

−10

−10
−10

−10

0.235 ● Recalibrated, RMSE = 0.135


0

A2040
GG30TA TG 0 −30 AA AG
10 −20 20 CA0 CG
−10 3010 GA2040
GG30TA TG −100 AA
−50AG CA0 CG GA
50 GG 100
TA TG AA AG CA CG GA GG TA TG
otide Machine
Reported Dinucleotide
QualityCycle Dinucleotide
Machine Cycle Dinucleotide
rted Quality)

rted Quality)
rted Quality)

10

10
10

9E = 1.656 ● Original,
Original, RMSE RMSE = 3.503
= 2.609 Original, RMSE = 2.469
RMSE
0.182 = 0.088 ● Recalibrated,
Recalibrated, RMSE = 0.06
RMSE = 0.089 Recalibrated, RMSE = 0.083

27 − −
5

5
5
You are here in the GATK Best Prac2ces workflow
for germline variant discovery

Data Pre-processing >> Variant Discovery >> Callset Refinement

Raw Reads
111 Analysis-Ready Var. Calling 111 Analysis-Ready SNPs
Reads HC in ERC mode Variants & Indels

Map to Reference Genotype Likelihoods


Non-GATK

BWA mem
Genotype
Mark Duplicates Refinement
Variant
& Sort (Picard) Joint Genotyping
Annotation

Indel Realignment Raw Variants SNPs Indels


Variant Evaluation
Base Recalibration
Variant Recalibration look good?
separately per variant type
Analysis-Ready
Reads
Analysis-Ready
SNPs Indels troubleshoot use in project
Variants
talks

Further reading
hPp://www.broadins2tute.org/gatk/guide/best-prac2ces

hPp://www.broadins2tute.org/gatk/guide/ar2cle?id=44

hPps://www.broadins2tute.org/gatk/guide/tooldocs/
org_broadins2tute_gatk_tools_walkers_bqsr_BaseRecalibrator.php

hPps://www.broadins2tute.org/gatk/guide/tooldocs/
org_broadins2tute_gatk_tools_walkers_readu2ls_PrintReads.php

hPps://www.broadins2tute.org/gatk/guide/tooldocs/
org_broadins2tute_gatk_tools_walkers_bqsr_AnalyzeCovariates.php

You might also like