Wk2 L6 Microbial Bioinf Tanaka-Handout

Microbiology 1. What is bioinformatics?
Wikipedia says:
Microbial bioinformatics Bioinformatics is an interdisciplinary field that develops methods
and software tools for understanding biological data. As an
interdisciplinary field of science, bioinformatics combines computer
1. What is bioinformatics? Why does science, statistics, mathematics, and engineering to analyze and
microbiology need bioinformatics? interpret biological data.
2. Bioinformatics and molecular evolution: ...so, using computers to analyse biological data
comparing genomes of bacterial species
Why do we need computers to analyse data?

3. Bioinformatics and molecular Prof Mark Tanaka
epidemiology: variation within species School of Biotechnology &
Biomolecular Sciences
Microbial genetics in 1980 Nature 1980
(in 1977)
Fast forward to the 21st century

All single nucleotide
polymorphisms (SNPs) in the
Here's a newer one
genomes of all 14 isolates
with 90 whole
genome sequences
Sometimes the SNPs are in

genes and sometimes they alter
the translated protein
Here's an even newer one with

815 whole genome sequences
They use a computational technique for

summarising the large data set which has a
high number of dimensions
Number of bacterial and archaeal genomes

sequenced and submitted to NCBI
Explosions and floods in biology
Over the past few decades, major advances in the field of

"Most biologists are drowning in too molecular biology, coupled with advances in genomic technologies,
much data, and in desperate need for have led to an explosive growth in the biological information
tools to help them make sense of their generated by the scientific community. This deluge of genomic
massive amounts of sequences."
information has, in turn, led to an absolute requirement for
computerized databases to store, organize, and index the data and
for specialized tools to view and analyze the data.
– National Center for Biotechnology Information USA (NCBI)
Land et al 2015 Funct Integr Genomics (2015) 15:141–161

Areas to which bioinformatics has contributed
The field of microbiology
has evolved so far that we
can no longer approach the General
data without computers office
●
Curating, storing, retrieving biological data
and sophisticated methods
for data analysis
Proteins
●
predicting and visualising protein structure
using data and algorithms
●
predicting and visualising interaction
labs between proteins
Even the design of new ●
predicting and visualising interaction
research buildings reflects between proteins and other molecules
the need for increased
computational work
Park et al 2008 Nature 454, 183-187
Gene and protein expression and regulation

Nucleotide sequences
●
analysing data from transcriptome
●
assembling sequence fragments to obtain the whole experiments
genome sequence
●
analysing data from high-throughput mass
●
finding genes and motifs in sequences using spectrometry
prediction algorithms
●
modelling gene regulation in the cell as
●
annotating genomes control circuits
●
comparing sequences to identify and measure
variation
●
analysing variation data using computational models
E.g. understanding microbial variation

using computational models of
population genetics and evolution
Other stuff
●
analysing images
●
mining text
Facciotti et al 2007 PNAS 104:4630–4635
Here's the table of

Bioinformatics contents from the
wikipedia site on
bioinformatics
Biology is complex and
Bioinformatics is full of complex computational tools
●
Supports molecular and cellular
biology with computational
Sometimes, multiple tools are used in a pipeline
tools for data analysis
●
Large field that covers most of
molecular and cellular biology
(bioinformaticians usually don't
know everything about all these
topics)
●
Rapidly evolving field
Can we be confident that each step does what it is
supposed to do?
Be a critical thinker!
2. Bioinformatics and molecular evolution: Coding sequences and intergenic regions
comparing genomes of bacterial species
On average 87% of bacterial genomes are Some genomes are undergoing

protein-coding decay and have a large number
Sequence data → Typical range: 85-90% of pseudogenes
assembled genome →
annotation
A lot of information even in a

single genome
https://www.flickr.com/photos/arkadyevna/227697075 Perna et al 2001 Nature 409: 529-533 Perna et al 2001 Nature 409: 529-533
How many genes do we find across

How can we use genetic information to establish
genomes from the same species?
evolutionary relationships among species?
Pan-genome:
all the genes you can
find across genomes in
a species
Core genome:
genes shared by all Salmonella Mycobacterium Escherichia
genomes in a species enterica tuberculosis coli
Land et al 2015 Funct Integr Genomics (2015) 15:141–161
Statistical calculations can be used for

phylogenetic inference
A phylogeny is an Which tree shows the true Using an evolution model, compute
relationships among the three taxa? the likelihood of each scenario
evolutionary tree showing
the relationships among
Use genetic information to infer the phylogeny taxa (often species)
or or ?
Se Mt Ec Mt Se Ec Ec Mt Se
Statistical calculations can be used for Statistical calculations can be used for
phylogenetic inference phylogenetic inference
For a given tree, the likelihood is the probability of observing genetic Select the tree with the highest Finding high likelihood trees can be
sequences according to a model of evolution probability. This is called the computationally challenging if there
maximum likelihood tree are many species.
or or ?
Se Mt Ec Mt Se Ec Ec Mt Se Se Mt Ec Mt Se Ec Ec Mt Se
L1 L2 L3
Phylogeny of bacteria
Can we use genetic sequences to establish evolutionary using genome data
relationships among all cellular lifeforms?
5591 sites in 31
proteins
Many genes are not

shared by all bacteria
Is there a genetic sequence shared by all cellular lifeforms that

evolves slowly enough to be used to infer a tree for all cellular life?
This one was made using the
maximum likelihood method.
16S ribosomal RNA There are other methods.
is often used to reconstruct phylogenies
Wu and Eisen 2008 Genome Biology
Distance-based methods Distance-based methods
A
B
A B C D E
C A
D B 3
E C 4 1
D 3 2 3
E 4 3 4 1
A vs B: 3 differences
A vs C: 4 differences A B C D E
A vs D: 3 differences A
A vs E: 4 differences
B 3 A B C D E
B vs C: 1 differences C 4 1
B vs D: 2 differences D 3 2 3
B vs E: 3 differences E 4 3 4 1
…
Where is the root of the tree? 3. Bioinformatics and molecular epidemiology:
Variation within species
C A
C A
C G
Species 1
C G
C G
C G
Polymorphic
A B C D E Outgroup A B C D E Substitution sites
A T
B D A T
A T
Species 2
A T
C E A G
A A G
Single nucleotide polymorphisms (SNPs)

Single nucleotide polymorphisms (SNPs)
Sequencing machine produces
thousands of raw reads
(sequence fragments)
We want to have so many
sequences that each site in the
genome is sequenced many times
in multiple fragments
Bioinformatics software maps

reads to a reference genome
Raw reads from sequencing machine
Some single nucleotides differ
consistently over many reads
that map to the same region
Reference genome (a closely related genome that has already been sequenced and assembled) (likely to be SNPs)
Single nucleotide polymorphisms (SNPs) SNPs in coding sequences can be

Nonsynonymous (the translated protein is different)
or
Synonymous (translated protein is the same)
Some polymorphisms in genomes are insertions,
deletions or rearrangements. Some SNPs are in coding
Others are SNPs genes and others are in
intergenic regions. Nonsynonymous SNPs may
be under natural selection –
they may be positively or
negatively selected
Perna et al 2001 Nature 409: 529-533

Recent work has attempted to
integrate epidemiological and
A challenge: can we identify what forces lead to the genomic data in a single analysis
patterns of variation we see in nature?
SNPs from
Mycobacterium
Phylogeny-like tree
Phylogenetic methods can help to identify tuberculosis
constructed using
the strains that are in a patient but they SNPs
don't model natural selection explicitly
Computational models of bacterial

populations evolving can explore
alternative hypotheses and examine Network of
which ones can explain the data Network of transmission
transmission events inferred
events inferred using genetic
using genetic data data and
epidemiological
information
Didelot et al 2015 Mol. Biol. Evol. 31(7):1869–1879
Computer simulation models can check the accuracy

of methods of inference using genomic data
Another simulation
model including
contact network
among hosts
A simulation model of an
epidemic with genome evolution
of pathogens within hosts
Worby et al 2014 PLoS Comput Biol 10(3): e1003549 Worby CJ, Read TD (2015) PLoS ONE 10(6): e0129745
Large volumes of genome sequences present an References

opportunity to understand microbial evolution and
epidemiology with greater resolution ●
Arai N et al 2018 Phylogenetic characterization of Salmonella enterica serovar Typhimurium and its
monophasic variant isolated from food animals in Japan revealed replacement of major epidemic clones
in the last four decades. J Clin Microbiol. 2018 Feb 28 In press.
●
Bedouelle (1980) Mutations which alter the function of the signal sequence of the maltose binding
A challenge for bioinformatics: protein of Escherichia coli. Nature 285, 78 - 81
●
Land M et al (2015) Insights from 20 years of bacterial genome sequencing, Funct Integr Genomics 15:141–
building computational models for genomic epidemiology that 161.
are realistic but efficient enough to be used for inference ●
Pallen MJ (2016) Microbial bioinformatics 2020. Microb Biotechnol Sep;9(5):681-6.
●
Octavia et al (2017) Genomic heterogeneity of Salmonella enterica serovar Typhimurium bacteriuria from
chronic infection, Infect Genet Evol. 51:17-20.
●
Carroll et al (2017) Whole-genome sequencing of drug-resistant Salmonella enterica isolated from dairy
cattle and humans in New York and Washington States reveals source and geographic associations, in
press, Appl Environ Microbiol.
Simple, ●
Wikipedia article on Bioinformatics, retrieved 15 May 2017.
tractable, easy Complex, confusing, ●
Didelot et al (2014) Bayesian inference of infectious disease transmission from whole-genome sequence
to interpret computationally slow data Mol Biol Evol 31(7):1869–1879.
●
Worby et al (2014) Within-host bacterial diversity hinders accurate reconstruction of transmission
Unrealistic, Realistic, based on networks from genomic distance data. PLoS Comput Biol 10(3): e1003549.
simplistic known mechanisms ●
Worby et al (2015) 'SEEDY' (Simulation of Evolutionary and Epidemiological Dynamics): An R Package to
follow accumulation of within-host mutation in pathogens, PLoS ONE 10(6): e0129745.

Wk2 L6 Microbial Bioinf Tanaka-Handout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wk2 L6 Microbial Bioinf Tanaka-Handout

Uploaded by

Copyright:

Available Formats

Microbiology 1. What is bioinformatics?

Why do we need computers to analyse data?

Microbial genetics in 1980 Nature 1980

Fast forward to the 21st century

Sometimes the SNPs are in

Here's an even newer one with

They use a computational technique for

Number of bacterial and archaeal genomes

Explosions and floods in biology

Over the past few decades, major advances in the field of

– National Center for Biotechnology Information USA (NCBI)

Land et al 2015 Funct Integr Genomics (2015) 15:141–161

Park et al 2008 Nature 454, 183-187

Gene and protein expression and regulation

E.g. understanding microbial variation

Here's the table of

On average 87% of bacterial genomes are Some genomes are undergoing

A lot of information even in a

How many genes do we find across

Land et al 2015 Funct Integr Genomics (2015) 15:141–161

Statistical calculations can be used for

Many genes are not

Is there a genetic sequence shared by all cellular lifeforms that

Wu and Eisen 2008 Genome Biology

Distance-based methods Distance-based methods

Single nucleotide polymorphisms (SNPs)

Bioinformatics software maps

Single nucleotide polymorphisms (SNPs) SNPs in coding sequences can be

Perna et al 2001 Nature 409: 529-533

Computational models of bacterial

Didelot et al 2015 Mol. Biol. Evol. 31(7):1869–1879

Computer simulation models can check the accuracy

Large volumes of genome sequences present an References

You might also like