You are on page 1of 124

Mid sem total portion

Ppt 1
Intro to bio
Intelligence in Biological Systems - 3

19BIO201
L-T-P-C:1-2-0-3
Syllabus
• Assembling genome using graph algorithm
• The string reconstruction problem
• String reconstruction as a walk in the overlap graph
• Gluing graphs
• de Bruijn graph
• The seven bridges of konigsberg - Euler’s theorem
• From Euler’s theorem to an algorithm for finding an
Eulerian cycle
• Assembling genomes from read pairs
• Python programming for bioinformatics
Course Objective
• To introduce the basic concepts of
bioinformatics using computational
methods
• To introduce programming for
bioinformatics
• To explore the challenges and the
potential of artificial intelligence in
bioinformatics
Course Outcome
• To understand basics of assembling
genome
• To learn Python programming for
bioinformatics
• To explore potential challenges and
applications of computational
bioinformatics
Evaluation Pattern
• Assignments: 10 x 4.5= 45
• Quizzes: 5 x 5 = 25
• Project: 30
Gregor Mendel – Father of Genetics
• Law of Segregation

• Law of Independent
Assortment

https://www.youtube.com/watch?v=
Mehz7tCxjSE
• Each species has blue print
of its life which is different
from other species
• The individuals in a species
have similarity yet
differences
• The blueprint are inherited
from one generation to
another
Genetics • Many traits are influenced
by environment also

Inherited traits are determined by the elements of heredity


that are transmitted from parents to offspring in
reproduction; these elements of heredity are called genes.
Can Disease be Inherited too
• The Black Urine Disease
• Reported as early as
1649
• British physician
Archibald Garrod
realized that certain
heritable diseases
followed the rules of
transmission as
described by Mendel
Genetic Disease

amyotrophic lateral sclerosis,


or ALS
Where is the gene?

And if the
sequence
has
4.6 x 107
Role of Computer Scientist
Important Milestones
• DNA established as the genetic material 1869 – Johann
Fridrich Miescher
• Genes on chromosomes are the discrete units of
heredity 1911 – Thomas Hunt Morgan
• Genes make proteins 1941 – George Beadle and
Edward Tatum
• Cytosine complements Guanine and Adenine
complements Thymine 1950 – Edwin Chargaff
• Double helical structure of DNA - 1952-1953 James D.
Watson and Francis H. C. Crick & Rosalind Frankline
Genome Sequencing
• Isolation of the first restriction enzyme 1970
Howard Temin and David Baltimore

• First human chromosome (number 22) - 1999


sequenced

• Human Genome Project Completed. April 2003


Course Objective
To understand
computational
techniques
involved in
genome
sequencing
The Bio aspects of the course
Deals with
• The structure of genetic material – DNA, Gene
• Terms involved
• Replication
• Mutation
• Experimental sequencing techniques
• Polymer Chain Reaction.
Genome
• Coined by Hans Winkler 1920
• Refers to “the Complete Genetic Material of
an organism”
• Complete DNA sequence of one set of
chromosome
• Encompasses both coding and the non-coding
sequence of DNA

https://www.nature.com/scitable/topicpage/dna-sequencing-technologies-key-to-the-human
-828/
(extra reading material)
DNA
Nitrogenous Bases
Molecular basis of DNA Structure
• Polynucleotide chain – sugar phosphate back bone having
nitrogenous base attached to it.
• Nucleotide has three elements – phosphate, pentose sugar,
nitrogenous base
• Pairing between Nitrogenous base is not chemical….
• Erwin Chargaff “A pairs with T (2 H bond) & C pairs with G (3 H
bond), differs among species but is constant in all cells of an organism and
within a species.
Human being - A29.8, T- 31.8, G-20.2, C -18.2, G+C- 38.4
Base composition in terms of % of total base.
• C – G bond stronger than A – T (amount of heat required to
saperate the DNA strand increases with increase in G + C )
• A+ G = C + T that is purines = pyrimidines
DNA …..
• Read from 5’ to 3’ direction. These labels are
indicative of free carbon on sugar phosphate
backbone ( 5’ has terminal phosphate group &
3’ has free OH)
• Length measured in base pair units (bp units)
• 3.2 billion base pair human genome
sequenced
The Double Helix Structure
• Helix is twisted in right hand direction
• Each turn measures 34 angstrom
• Bases are spaced at 3.4 angstrom
• There are ten base pair in each helical strand
• Bases are perpendicular to the sugar phosphate
backbone but stacked parallel to each other
• Two grooves, the major and the minor groove
appar on the helix. These provide binding site for
the proteins.
https://www.youtube.com/watch?v=ThG_02mi
q-4
Ppt 2
DNA structure
DNA
Code
for
Life
DNA
Nitrogenous Bases
Molecular basis of DNA Structure
• Polynucleotide chain – sugar phosphate back bone
having nitrogenous base attached to it.
• Phosphodiester Bond – the backbone of DNA
• Nucleotide has three elements – phosphate,
pentose sugar, nitrogenous base
• Pairing between Nitrogenous base is not chemical….
• Base Stacking – Allows millions of base pairs lie one
above the other
Chargaff Rule
• Erwin Chargaff “A pairs with T & C pairs with G
• C – G bond stronger than A – T (amount of heat
required to separate the DNA strand increases with
increase in G + C )
• Base composition – % of G + C in terms of % of total
base
differs among species but is constant in all cells of an
organism and within a species.
Human being – A 29.8, T- 31.8, G-20.2, C -18.2, G+C-
38.4
• A+ G = C + T that is purines = pyrimidines
Evaluate yourself
1. The base sequence for complete set of
chromosomes --------
2. Nucleotide is made of --------, ----------- & -------
3. Phosphodiester bond is the bond between ----
4. Purines and pyrimidines are ------------
5. Number of hydrogen bond between adenine and
thyamine -----------
6. Arrangement of base pairs in DNA ---------
7. The base composition for a species remains --------
Answers
1. Genome
2. Pentose sugar, phosphate group, nitrogenous
base
3. Pentose sugar & phosphate group
4. Nitrogenous bases
5. 2
6. Base stacking
7. constant
Watson Crick Model
The Double Helix Structure
• Right hand twisted helix
• Each turn measures 34 Å
• Bases are spaced at 3.4 Å
• Ten base pairs in each helical strand
• Bases are perpendicular to the sugar
phosphate backbone but stacked parallel to
each other
• Grooves - The major & the minor groove,
provide binding site for the proteins.
•Read from 5’ to 3’ direction.
Labels  indicate  free carbon on
sugar phosphate backbone ( 5’
has terminal PO4 & 3’ has free
OH)
•DNA Length measured in base
pair  (BP) units
•1kb = 1000bp
Shortest human DNA 4.6x107
•Sugar lie above and below the
plane containing  the base pair
Some facts…..
• DNA is a very dynamic molecule
• Satisfy the criteria for genetic material
- can make a copy of itself
- should code for life
- allow for changes in progeny
Points to ponder…..

• An important part of understanding biology is


learning its language. 
• Biologists, like many scientists, use technical
terms in order to be precise about reference.
• Getting a grasp on this terminology makes a
great deal of the biological literature
accessible to the non-specialist
Terms learnt so far…..
• Genome • Helix
• Nucleotide • Groove
• Nitrogenous bases • Progeny
• Purines & pyrimidines • Reverse and forward
• Base Pair strand
• Base stacking • 5’ & 3’
• Single strand
• Double strand
Ppt 3
Replication and intro to sequencing
Previous concepts
• Building block of DNA are
nucleotides
• Double helix
• Both strands are anti parallel
• Million of base pair lie one
over other
• DNA read from 5’ to 3’
• Length measured in base
pair units
• Kb = 1000 base pair
Organization of DNA
• DNA strand is longer than
the nucleus
• Smallest DNA is
14000 μm.
• Average size of nucleus
6 μm
• DNA is packed as
Chromosome
• Packing ratio – Length
of DNA/length of
Chromosome
Replication
• Biological process of
producing two identical
replicas of DNA from one
original molecule.
• DNA make copy of itself
• Each strand acts as
template
• can also be performed in
vitro
• Double stranded molecule gets converted into
two identical double stranded molecule/DNA
Essentials
• A parent strand as template
• Nucleotides containing bases adenine,
guanine, cytosine & thymine
• RNA Primer – oligonucleotide containing upto
30 bp
• DNA polymerase
• Some proteins and enzymes like helicase,
ligase
Primer
Replication begins at 3’end of each strand
Bases are added one at a time
Process continues till the strand is completed
Questions
• The process of replication begins from ----------
end of DNA
• The enzyme which helps in extension of strand
is -----------
• Primer is essential to ------------------
• Nucleotides are of ----------- types
DNA Sequencing
• Process to find base pair sequence in a given
DNA
• DNAs present in each cell in an organism are
same
• Genome sequence is the sequence of all DNA
present in an organism’s cell
• Generally the germ cell (egg or sperm) is
considered because it is haploid
Why sequencing
• DNA sequences required for
• Basic biological research
• Research in evolutionary pattern
• Applied fields such as medical diagonsis,
Biotechnology, Forensics, virology
Methods
• Classical Methods – Maxam Gilbert method,
Sengar method
• Accurate
• Slow, read short fragments, expensive, harmful
chemicals used.
• De Novo or Next Gen sequencing – Shotgun
method, illumina sequencing methods
• Faster, longer fragments can be read, lesser
accuracy, cheaper
• Isolation of
DNA/specific portion
• Shear into
pieces/fragments
• Amplify
• labeling
• Sequencing fragments
• Generate Reads
• Assemble reads in
order
Challenges
• Read size
• Time
• Accuracy
• Cost
Ppt 4
genome sequencing
Genome Sequencing
Timeline of large-scale genomic DNA Sequencing
• Isolation of
DNA/specific portion
• Shear into
pieces/fragments
• Amplify
• labeling
• Sequencing fragments
• Generate Reads
• Assemble reads in
order
Isolation of DNA
• Lyse – Breaking the cell
membrane
• Bind – Binding of nucleic
acid to silica gel membrane
• Wash - washing the nucleic
acid bound to the silica gel
membrane to remove
impurities
• Elute – Removing the DNA
from silica gel membrane
https://www.youtube.com/watch?v=qfa0hi6s35E
Fragmentation
• Longer sequences subdivided into smaller
fragments.

• Sequencing can only be performed for fairly short


strands (100 to 5000 base pairs)
• Quality of the base identification decreases with
length.
• Three methods – Physical, Chemical and enzyme
assisted.
Physical method
• Physical methods like ultrasonication, acoustic
shearing use different frequency of sound
wave to shear DNA.
• The fragments obtained are nonspecific.
• Hydrogen bond in double helix as well as the
oxygen carbon bond broken.
Enzymatic Fragmentation
• Enzyme assisted fragmentation uses Endonuclease
enzymes.
• Involved in defense mechanism.
• Extracted from several bacteria.
• These enzymes are site specific, the sequence of end part
of fragments are known.
• Can break both strands uniformly or leave sticky ends
• DNA fragment produced are called restriction fragment
• Human genome produces millions of fragments
• Fragments are separated by electrophoresis
Enzymes and fragmentation
Amplification
• Process to increase the number of fragments
• Helps in isolation & identification of fragments
by gel electrophoresis
• Polymerase chain reaction(PCR)
• Cloning(DNA recombinant Technology)
PCR
• The method for selective amplification is called
the polymerase chain reaction (PCR).
• Kary B. Mullis received Nobel Prize in 1993.
• PCR amplification requires DNA polymerase, a
pair of short, synthetic primers & nucleotides
• DNA polymerase obtained from heat resistant
bacteria
• Primers complementary in sequence for each
strand of the fragment.
• The three steps are
Denaturation, annealing and
elongation
• Denaturation is unzipping of
DNA. Temperature 95oC
• Annealing of primer at
temperature 50-60oC
• Elongation of chain in 5' direction
by addition of nucleotides
• First cycle there is a pair of
parent strands and a pair of
synthetic strand
• At the end of 25th cycle 3.4x107
fragments.
https://www.youtube.com/watch?v=ThG_02miq-4
DNA Cloning
• Plasmids are special DNA
in certain bacteria
• Selected fragment of
DNA can be inserted into
the fragment by cut and
paste mechanism
• Uses restriction enzymes.
• The recombinant DNA
amplifies when insrted in
bacteria.
Amplification and isolation

•The fragments of DNA are isolated by lysis with


restriction enzymes.
•The fragments obtained are used for further analysis.
Additional information……
Sequencing
• Identification of base sequences in the
fragments
• Methods like Sanger Sequencing, Pyro
sequencing, Illumina sequencing etc employed
• Sanger method sequenced the first genome
Ppt 5
Repeats
Review….
• Genome is complete DNA sequence of one set of
chromosome
• Contains both coding and the non-coding
sequence of DNA
• DNA is a polymer called poly nucleotide 
• Nucleotide unit is made of – Phosphate group,
pentose sugar (deoxyribose) and nitrogenous base
• The phosphate can attach to the sugar at 3’ or 5’
position
• Nitrogenous base always at 1’ position
• Two polynucleotide chain make a DNA (double helix
structure)
• Both chain are anti parallel – 5’ of one pair the 3’ end
of another chain
• Chargaff rule – A pairs with T & G pairs with C
• Base composition – % G + C in a genome Fixed for a
species
• Base Stacking – Allows millions of base pairs lie one
above the other
• DNA Length measured in base pair  (BP) units (1kb =
1000 bp)
• Both reverse and forward strand read from 5’ direction
• DNA is a very dynamic molecule
• Satisfy the criteria for genetic material - make a
copy of itself, code for life, allow for changes
• DNA is packed as Chromosome
• Packing ratio – Length of DNA/length of Chromosome
• Relplication – biological process by which DNA
makes a copy of itself
• Each strand act as a template
• can also be performed in vitro
Essentials for replication
• A parent strand as template
• Nucleotides containing bases adenine, guanine,
cytosine & thymine
• RNA Primer – oligonucleotide containing upto
30 bp
• In vitro synthesis - DNA primer is used
• DNA polymerase
• Some proteins and enzymes like helicase, ligase
How a genome looks like?
Coding Regions
• Called Genes
• Roughly 20K in number
• Make 5% of the total genome
• Eukaryotic gene contain interspersed non
coding repeated sequence – Introns
Repeated Sequences
• Function largely unknown or poorly
understood – Labeled as Junk DNA
• Repeat can be Tandem or Interspersed
Tandem Repeats

•Regions include large number of repeated DNA sequence


family
• Array can be simple or complex
•Highly repetitive
•Known as Satellite DNAs
Telomere & Centromere

• Telomeres make the


end of Chromosomes
• Base sequence (T/A)xGy
• Human telomere
TTGGGG
• Usually repeated about
3,000 times and can
reach up to 15,000 base
pairs.
Interspersed Repeated Sequence
• Identified by Barbara McClintock 1951
• Sequence dispersed throughout the sequence
• Linked to transposable elements in genome
• Can move in the genome
• Two types – Transposons & Retrotransposons
• Transposons move from one place to another by
cut and paste mechanism
• Retrotransposon move by making a copy and paste
Alu Family
 5' - Part A - A5TACA6 - Part B - PolyA Tail - 3'
• Alu sequences are Repeatitive
DNA elements
• An estimated frequency of
500 000 to 1 million copies
    per genome. 
• Primate-specific
• Interspersed
• May serve as functional genes 
• Retrotranposon mediated
reinsertion throughout the
genome over 65 million years of
primate evolution
Some Interesting Facts
Ppt 6
Sequencing technologies
From fragments to reads
Replication
• Replication can be achieved in
vitro if template DNA, Primer,
Polymerase enzyme and
nucleotides are available.
• Nucleotides are added
one at a time
Fragmentation and amplification
• Isolated DNA can be sheared into fragments
either with known ends or unknown ends
• The fragments can be separated into libraries
• Amplified into millions of copies by cloning or
polymerase chain reaction (PCR)
• PCR involves Denaturation, Annealing and
Elongation
Genetic map
Type of
Chromosome map that shows
the relative locations
of genes and other important
features.
Knowledge of it
ensures that the
sequencing
process can be
tailored
Genomic library
A genomic library is
usually stored as a set of
bacteria, each carrying
a different fragment of
human/any species DNA.
Hybridization

Single strand DNA can


pairup with any
complementary strand.
This forms basis of
sequencing techniques
like southern blotting,
Illumina, DNA Array
Sanger Sequencing
• Dideoxy or chain terminataion method
• Fred Sanger, 1977, X174 virus, 5386bp
• Small genome like that of virus succesfully
sequenced
• Based on in vitro DNA synthesis performed in
the presence of chain-terminating nucleotides
(dideoxy nucleoside triphosphates)
• ddNTPs: ddATP, ddGTP, ddCTP, and ddTTP
dideoxy nucleoside triphosphates
Sanger Sequencing Steps
• DNA fragments separated into single strand
• Primer attached to the sequence
• Polymerase solution, four types of dNTPs
(nucleotides) & ddNTP are added.
• Replication process in fragments are stopped as
soon as one of the ddNTP attaches.
• Fragments are separated by gel electrophoresis
and the sequence is determined.
Fragment and ddNTP
• Very accurate, 99.99% base accuracy
• Considered “gold standard” for validating
DNA sequences
• Smaller genome sequenced, cost effective
• Cheap 3$ per cycle
• Most widely used sequencing method for
approximately 40 years
• Read length maximum upto 500 bp
• Slow, becomes expensive for larger and
complex genome
DNA arrays
• Radoje Drmanac, Andrey Mirzabekov, and
Edwin Southern, 1988
• Goal of cheaply generating a genome’s k-mer
composition
• A smaller read length k approx 10bp
Working DNA arrays
• First synthesize all 4k possible DNA k-mers. For a
k mer of length 10bp, the possible k mers to be
synthesized is 4 = 1,048, 576
• Attach them to a DNA array, which is a grid on
which each k-mer is assigned a unique position
• Solution with unknown single strand DNA
fragment with fluorescent label was applied to
the DNA array.
• Unknown fragment hybridizes with the
complementary k mer on array
Reading the array
• Fluoresence was analyzed
with spectroscopy
• The reverse complements
of k-mers corresponding to
these sites belong to the
(unknown) DNA fragment
• k-mers on the array
reveals the sequence of a
DNA fragment
Array technology application
• Could not be applied to genome sequencing because
Fidelity of DNA for hybridization with the array was
too low and value of k was too small
New Application
• arrays are used to measure gene expression
• Analyze genetic variations.
• DNA arrays is a multi-billion dollar industry that
included Hyseq, founded by Radoje Drmanac (one of
the original inventors of DNA arrays.)
NGS – Next generation sequencing
• Illumina
• Nanopore techniques
Sample preparation
1. Breaking and adaptor attachment

2. Fragment
with adaptor

3. Achment of
bridging group
Cluster Generation
• Clustering is amplification process in flow cell
• Each fragment is first attached onto glass
channels on a flow cell and then amplified into
millions of copy
Sequencing
• Begins by adding one nucleotides at a time
which generates a signal
• The reads are generated for forward as well as
reverse strand
• Illumina generates
paired reads
that is two reads for
each fragment
Data Analysis
• Preliminary data analysis is done
• Data is locally clustered based on indices given to
each cluster
• Contiguous sequence prepared
• Contig aligned to reference genome for
verification and identification.
Nanopore Sequencing
• Unique & scalable technology
• Enables direct, real-time analysis of long DNA or RNA
fragments.
• It works by monitoring changes to an electrical current
as nucleic acids are passed through a protein
nanopore.
• The resulting signal is decoded to provide the specific
DNA or RNA sequence.
• Advantage is fast and cost effective & disadvantage is
error in reads

You might also like