You are on page 1of 7

Microbiology 1. What is bioinformatics?

Wikipedia says:
Microbial bioinformatics Bioinformatics is an interdisciplinary field that develops methods
and software tools for understanding biological data. As an
interdisciplinary field of science, bioinformatics combines computer
1. What is bioinformatics? Why does science, statistics, mathematics, and engineering to analyze and
microbiology need bioinformatics? interpret biological data.

2. Bioinformatics and molecular evolution: ...so, using computers to analyse biological data
comparing genomes of bacterial species

Why do we need computers to analyse data?


3. Bioinformatics and molecular Prof Mark Tanaka
epidemiology: variation within species School of Biotechnology &
Biomolecular Sciences

Microbial genetics in 1980 Nature 1980

(in 1977)

Fast forward to the 21st century


All single nucleotide
polymorphisms (SNPs) in the
Here's a newer one
genomes of all 14 isolates
with 90 whole
genome sequences

Sometimes the SNPs are in


genes and sometimes they alter
the translated protein

Here's an even newer one with


815 whole genome sequences

They use a computational technique for


summarising the large data set which has a
high number of dimensions

Number of bacterial and archaeal genomes


sequenced and submitted to NCBI

Explosions and floods in biology

Over the past few decades, major advances in the field of


"Most biologists are drowning in too molecular biology, coupled with advances in genomic technologies,
much data, and in desperate need for have led to an explosive growth in the biological information
tools to help them make sense of their generated by the scientific community. This deluge of genomic
massive amounts of sequences."
information has, in turn, led to an absolute requirement for
computerized databases to store, organize, and index the data and
for specialized tools to view and analyze the data.

– National Center for Biotechnology Information USA (NCBI)

Land et al 2015 Funct Integr Genomics (2015) 15:141–161


Areas to which bioinformatics has contributed
The field of microbiology
has evolved so far that we
can no longer approach the General
data without computers office

Curating, storing, retrieving biological data
and sophisticated methods
for data analysis
Proteins

predicting and visualising protein structure
using data and algorithms

predicting and visualising interaction
labs between proteins
Even the design of new ●
predicting and visualising interaction
research buildings reflects between proteins and other molecules
the need for increased
computational work

Park et al 2008 Nature 454, 183-187

Gene and protein expression and regulation


Nucleotide sequences

analysing data from transcriptome

assembling sequence fragments to obtain the whole experiments
genome sequence

analysing data from high-throughput mass

finding genes and motifs in sequences using spectrometry
prediction algorithms

modelling gene regulation in the cell as

annotating genomes control circuits

comparing sequences to identify and measure
variation

analysing variation data using computational models

E.g. understanding microbial variation


using computational models of
population genetics and evolution
Other stuff

analysing images

mining text
Facciotti et al 2007 PNAS 104:4630–4635

Here's the table of


Bioinformatics contents from the
wikipedia site on
bioinformatics
Biology is complex and
Bioinformatics is full of complex computational tools

Supports molecular and cellular
biology with computational
Sometimes, multiple tools are used in a pipeline
tools for data analysis

Large field that covers most of
molecular and cellular biology
(bioinformaticians usually don't
know everything about all these
topics)

Rapidly evolving field
Can we be confident that each step does what it is
supposed to do?

Be a critical thinker!
2. Bioinformatics and molecular evolution: Coding sequences and intergenic regions
comparing genomes of bacterial species

On average 87% of bacterial genomes are Some genomes are undergoing


protein-coding decay and have a large number
Sequence data → Typical range: 85-90% of pseudogenes
assembled genome →
annotation

A lot of information even in a


single genome

https://www.flickr.com/photos/arkadyevna/227697075 Perna et al 2001 Nature 409: 529-533 Perna et al 2001 Nature 409: 529-533

How many genes do we find across


How can we use genetic information to establish
genomes from the same species?
evolutionary relationships among species?

Pan-genome:
all the genes you can
find across genomes in
a species

Core genome:
genes shared by all Salmonella Mycobacterium Escherichia
genomes in a species enterica tuberculosis coli

Land et al 2015 Funct Integr Genomics (2015) 15:141–161

Statistical calculations can be used for


phylogenetic inference

A phylogeny is an Which tree shows the true Using an evolution model, compute
relationships among the three taxa? the likelihood of each scenario
evolutionary tree showing
the relationships among
Use genetic information to infer the phylogeny taxa (often species)
or or ?

Se Mt Ec Mt Se Ec Ec Mt Se
Statistical calculations can be used for Statistical calculations can be used for
phylogenetic inference phylogenetic inference

For a given tree, the likelihood is the probability of observing genetic Select the tree with the highest Finding high likelihood trees can be
sequences according to a model of evolution probability. This is called the computationally challenging if there
maximum likelihood tree are many species.

or or ?

Se Mt Ec Mt Se Ec Ec Mt Se Se Mt Ec Mt Se Ec Ec Mt Se

L1 L2 L3

Phylogeny of bacteria
Can we use genetic sequences to establish evolutionary using genome data
relationships among all cellular lifeforms?
5591 sites in 31
proteins

Many genes are not


shared by all bacteria

Is there a genetic sequence shared by all cellular lifeforms that


evolves slowly enough to be used to infer a tree for all cellular life?
This one was made using the
maximum likelihood method.
16S ribosomal RNA There are other methods.
is often used to reconstruct phylogenies

Wu and Eisen 2008 Genome Biology

Distance-based methods Distance-based methods

A
B
A B C D E
C A
D B 3

E C 4 1
D 3 2 3
E 4 3 4 1
A vs B: 3 differences
A vs C: 4 differences A B C D E
A vs D: 3 differences A
A vs E: 4 differences
B 3 A B C D E
B vs C: 1 differences C 4 1
B vs D: 2 differences D 3 2 3
B vs E: 3 differences E 4 3 4 1

Where is the root of the tree? 3. Bioinformatics and molecular epidemiology:
Variation within species

C A
C A
C G
Species 1
C G
C G
C G
Polymorphic
A B C D E Outgroup A B C D E Substitution sites

A T
B D A T
A T
Species 2
A T
C E A G

A A G

Single nucleotide polymorphisms (SNPs)


Single nucleotide polymorphisms (SNPs)
Sequencing machine produces
thousands of raw reads
(sequence fragments)
We want to have so many
sequences that each site in the
genome is sequenced many times
in multiple fragments

Bioinformatics software maps


reads to a reference genome
Raw reads from sequencing machine
Some single nucleotides differ
consistently over many reads
that map to the same region
Reference genome (a closely related genome that has already been sequenced and assembled) (likely to be SNPs)

Single nucleotide polymorphisms (SNPs) SNPs in coding sequences can be


Nonsynonymous (the translated protein is different)
or
Synonymous (translated protein is the same)
Some polymorphisms in genomes are insertions,
deletions or rearrangements. Some SNPs are in coding
Others are SNPs genes and others are in
intergenic regions. Nonsynonymous SNPs may
be under natural selection –
they may be positively or
negatively selected

Perna et al 2001 Nature 409: 529-533


Recent work has attempted to
integrate epidemiological and
A challenge: can we identify what forces lead to the genomic data in a single analysis
patterns of variation we see in nature?
SNPs from
Mycobacterium
Phylogeny-like tree
Phylogenetic methods can help to identify tuberculosis
constructed using
the strains that are in a patient but they SNPs
don't model natural selection explicitly

Computational models of bacterial


populations evolving can explore
alternative hypotheses and examine Network of
which ones can explain the data Network of transmission
transmission events inferred
events inferred using genetic
using genetic data data and
epidemiological
information

Didelot et al 2015 Mol. Biol. Evol. 31(7):1869–1879

Computer simulation models can check the accuracy


of methods of inference using genomic data

Another simulation
model including
contact network
among hosts

A simulation model of an
epidemic with genome evolution
of pathogens within hosts

Worby et al 2014 PLoS Comput Biol 10(3): e1003549 Worby CJ, Read TD (2015) PLoS ONE 10(6): e0129745

Large volumes of genome sequences present an References


opportunity to understand microbial evolution and
epidemiology with greater resolution ●
Arai N et al 2018 Phylogenetic characterization of Salmonella enterica serovar Typhimurium and its
monophasic variant isolated from food animals in Japan revealed replacement of major epidemic clones
in the last four decades. J Clin Microbiol. 2018 Feb 28 In press.

Bedouelle (1980) Mutations which alter the function of the signal sequence of the maltose binding
A challenge for bioinformatics: protein of Escherichia coli. Nature 285, 78 - 81

Land M et al (2015) Insights from 20 years of bacterial genome sequencing, Funct Integr Genomics 15:141–
building computational models for genomic epidemiology that 161.
are realistic but efficient enough to be used for inference ●
Pallen MJ (2016) Microbial bioinformatics 2020. Microb Biotechnol Sep;9(5):681-6.

Octavia et al (2017) Genomic heterogeneity of Salmonella enterica serovar Typhimurium bacteriuria from
chronic infection, Infect Genet Evol. 51:17-20.

Carroll et al (2017) Whole-genome sequencing of drug-resistant Salmonella enterica isolated from dairy
cattle and humans in New York and Washington States reveals source and geographic associations, in
press, Appl Environ Microbiol.
Simple, ●
Wikipedia article on Bioinformatics, retrieved 15 May 2017.
tractable, easy Complex, confusing, ●
Didelot et al (2014) Bayesian inference of infectious disease transmission from whole-genome sequence
to interpret computationally slow data Mol Biol Evol 31(7):1869–1879.

Worby et al (2014) Within-host bacterial diversity hinders accurate reconstruction of transmission
Unrealistic, Realistic, based on networks from genomic distance data. PLoS Comput Biol 10(3): e1003549.
simplistic known mechanisms ●
Worby et al (2015) 'SEEDY' (Simulation of Evolutionary and Epidemiological Dynamics): An R Package to
follow accumulation of within-host mutation in pathogens, PLoS ONE 10(6): e0129745.

You might also like