You are on page 1of 41

AI & Computer

Vision
UFMFEV-30-M: Genome Sequencing
Overview
• High Throughput Sequencing Technologies
• 454
• IonTorrent
• Illumina
• PacBio
• Oxford Nanopore
• Sequencing Data and Assembly
• Reads
• De novo and reference-based assembly

2 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


What is Sequencing Used For?
• Whole genome sequencing
• Genomic analysis
• Targeted genomic resequencing
• Metagenomics
• Transcriptomics
• Transposon sequencing (Tn-Seq)

3 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


High-throughput
sequencing
Overview and Approaches
High Throughput Sequencing
• The sequencing of the human genome in 2001 changed the DNA sequencing
process
• 20-year, $3 billion USD effort
• Completed in 2001
• Lowered cost to $100 million per genome
• Used Sanger Sequencing
• Spurred the development of cheaper sequencing
• Race for the development of the fastest, cheapest and most accurate
sequencers

5 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Why is HTS important?
• Genetic material from HGP comes from only a few individuals
• We still don’t understand most of the genome
• Need to sequence thousands to millions of individuals to study genetic diseases
and function of genes
• Can’t do this with Sanger sequencing

6 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Sequencing costs
• Sequencing costs have dropped dramatically

https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data

7 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Library Preparation
• All NGS approaches rely on a library preparation
• Uses either native or amplified DNA
• Fragmentation
• Size selection
• Adaptor ligation

8 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Roche (454) Polymerase

• Pyrosequencing
  ACCTTGAGTACCATCTAGGA---------
AGATCCT---------
• Polymerase
dATP PPi
ATP-Sulfurylase
• ATP-Sulfurylase ATP
Luciferase

• Luciferase Light

9 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Roche (454)
• Library Preparation
• Emulsion PCR
• Loading
• Beads are loaded in 25 µm pico-wells

10 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Roche (454)
• Sequencing by Synthesis
• Each bead contains copies of
a single DNA fragment that
is clonally amplified
• One bead = one read
• ~1 million reads/run
• Reads up to 700 bp

11 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


IonTorrent
• Does not make use of optical signals
• Exploit the release of H+ ions with addition of a dNTP to a DNA polymer
• Semiconductor Sequencing

12 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina
• 90% of sequencing market
• Imaging-based method
• Many reads – millions to billions per run
• High fidelity: > 99.9% accuracy
• $1,000 human genome in 48 hours
• Sequencing by synthesis

HiSeq Flow Cell – 3 billion reads

13 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina

MiniSeq MiSeq NextSeq HiSeq X

MiniSeq MiSeq NextSeq HiSeq X Sanger


Reads (millions) 16 50 260 13,000 0.0004
Gigabases/day 2.4 5.1 39 4000 0.001

14 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina: Sample Preparation
• DNA is fragmented (enzymatic or physical shear forces)
• Ligated to primer and adaptor sequences
• Index sequences can be added for multiplexing

Primer Insert Adaptor

Adaptor
Primer

15 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina: Cluster Generation
• Clonal amplification by bridge PCR

16 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina:
Reversible Terminator Chemistry
• Polymerase bound to template incorporates a
single fluorescently modified nucleotide
• Incorporation terminates DNA synthesis
• Remaining nucleotides are washed away
• Imaging followed by cleavage step to removing
the inhibiting group and fluorophore

Nat Rev Genet 11, 31–46


17 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)
Paired end Sequencing
• Allows sequencing of both ends of a
fragment
• Because the distance between each paired
read is known, alignment algorithms can use
this information to map the reads over
repetitive regions more precisely

/1 read /2 read

18 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Illumina

19 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Long-read technologies
• Long-read technologies can generate continuous sequences of kb to mb
• There are currently two long-read sequencing technologies
• PacBio
• Oxford Nanopore

• Why would long reads be advantageous for sequencing projects?

20 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


PacBio
• PacBio SMRT technologies uses a circular DNA template
• Composed of a double-stranded DNA insert with single-stranded hairpin
adapters on either end

Hairpin Insert Hairpin

21 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


PacBio
• ZMWs (Zero-mode waveguides)
• Picolitre (10-12 mL) volume wells
• Template and polymerase immobilised on the
bottom of the well
• Fluorescently-labelled dNTPs are briefly held
in the detection volume
• The phosphate-linked fluorophore is cleaved
from the nucleotide as part of the
incorporation of the base

Nature Reviews Genetics 21, pp 597–614


22 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)
PacBio Read Types
• Continuous Long Reads (CLR)
• Inserts >30kb
• Single pass by Polymerase
• High Fidelity Reads
• Circular consensus sequencing (CCS)
• Smaller inserts 10-30kb
• Polymerase can make several passes
through the SMRTbell template

Nature Reviews Genetics 21, pp 597–614


23 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)
PacBio

24 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Oxford Nanopore
• Direct sequencing – no synthesis!
• Protein nanopores
• In nature function as gateways between two systems
• R9 Pore
• CsgG from E. coli
• Nonameric lipoprotein
• Shape and dimensions ideal
• Heavily engineered (>700 mutants)

25 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Oxford Nanopore: Principles
• Nanopores are set in an electrically resistant polymer membrane
• An ionic current is passed through the nanopore by setting a voltage across the
membrane
• If an analyte passes through the pore, this creates a characteristic disruption in
current

26 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Oxford Nanopore

27 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Pros & Cons
• Each of the sequencing technologies has advantages and disadvantages
• Illumina
• Accurate sequencing; short reads (150+ bp); billions of sequences; long run time
• 454
• Susceptible to homopolymeric errors, long reads (500+ bp); millions of sequences; 8 hour run
• Ion torrent
• Longer reads (500+ bp); millions of sequences; mate-pair libraries; fast 2 hr run
• PacBio
• Very long sequences (kbp to mbp); single molecule sequencing; DNA often falls off the
polymerase; expensive
• Nanopore
• long sequences (kbp to Mbp!!!); minimal chemistry and sample prep; cheap; lower accuracy
(currently ~90%); fast; large DNA input requirements (µg)

28 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Genome
Sequencing
Data & Assembly
Read Analysis
• The first step in assembly is to examine the read data
• Sequencing data is often stored in FASTQ format
• Text-based format for storing a nucleotide sequence
and corresponding quality scores

Start Symbol Sequence ID Adaptor Sequence

@HWI-D00151:214:HYFTWADXX:1:1101:2002:2201 1:N:0:CAGAGAGGTATCCTCT
GCTCTACACGGTAGTAAACACGACGAGGCACACCCATCTTTTTTTCAGAG
+
8BB;FFFFFFFFFIFII@I=II…-&-&-*,,,,,IIIIIIIIIFFFFFFFFFFFFFFFF Sequence

Separator Line
Encoded quality values, one
symbol per nucleotide

30 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


FASTQ Quality Encoding
• Letters and symbols are used to represent numbers

!”#$%&’()*+,-./0123456789::;<=>?@ABCDEFGHIJ

Q0 Q10 Q20 Q30 Q40

Bad Maybe OK Good Excellent

31 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


FASTQC
• FASTQC is used to assess the quality of sequence read data
Good Bad

Y-axis is the quality score (higher is better)

32 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Quality Trimming & Adaptor Clipping
Quality Trimming
• Remove low quality sequences
• Q = 13 corresponds to a 5% error (p=0.05)
• Q= 0..13 encoded by: !”#@%&’()*+,-/
• Trim using window moving average

Adaptor Clipping
• Align 3’ and 5’ ends of reads against all adaptor sequences
• If a match is found, the read is trimmed

33 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Aims of Read Processing
• To remove sequences which do not belong to the genome
• Adaptors
• DNA from Spike-in controls (e.g. phiX29)
• To remove low quality reads
• To trim reads to remove low quality base calls at the 5’ or 3’ ends
• This process is important
• Garbage in = garbage out

34 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Genome/Sequence Assembly
• The computational process of
putting nucleotide sequences Genome
into the correct order
• Basic problem
• Genome: 2Mb Reads
• Sequencing Technology: 100 bp –
50kb
• How to do go from fragments into
longer regions? Contigs

Scaffolds
De novo assembly
• De novo assembly is the process of reconstructing the original DNA sequence
using only the read sequences
• Like a jigsaw puzzle
• Involves finding overlaps between reads
• Sequencing errors can impair our ability to do this

36 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Reference-based assembly
• The reference-based assembly approach involves mapping each read to a
reference genome sequence
• Very useful for identification of genetic variations:
• SNPs, indels, copy number varients
• Needs a good quality a priori genome sequence
• Reads which do not map to the reference genome need to be examined

37 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Reference-based assembly
• Read mapping is the process of
aligning reads to a reference
genome
• Mapping allows mismatches,
indels and clipping to some
degree

38 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Assembly Metrics
• Genome assembly
• Total length similar to genome size Reads
• Aiming for fewer, larger contigs
• Assess correctness of contig High coverage Low coverage
• Metrics
• Coverage
• Fraction of the genome sequenced by at least
one read
• Depth
• Average number of reads that cover any given
region
• Maximum and minimum contig lengths
• N50

39 08/19/2021 AI & Computer Vision: Application in Healthcare (UFMFEV-30-M)


Summary
Summary
• Advances in genome sequencing technologies continue apace
• These technologies can be broadly categorised as
• Short or long read
• Sequencing by synthesis or direct sequencing
• All HTS require significant work to create sequencing libraries
• HTS is the dominant method used in sequencing but >50 years after its
introduction, Sanger sequencing is still useful

You might also like