Professional Documents
Culture Documents
Basic Concepts of
Bioinformatics
TOC
Introduction
Basic concepts in Molecular biology
Bioinformatics techniques
Areas in bioinformatics
Applications
Related Computer Technology
Conference in Glasgow
Acknowledgements
Reference
M.Alroy Mascrenghe 2
Introduction……
M.Alroy Mascrenghe 3
2000
A Major event happened that was to
change the course of human history
It was a joint British and American
effort
nothing to do with IRAQ!
It was a race – who will complete
first
Race Test – not whether they have
taken drugs but whether they can
produce them!
Human genome was sequenced
M.Alroy Mascrenghe 4
A Situ…somewhere in the
near future
A virus –not ‘I love you’ virus- creates an epidemic
Geneticists and bioinformaticians role on their
sleeves
Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
As the characteristics of the other viruses are
known
From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
When the protein (sequence and structure) is
known then medicines can be designed
M.Alroy Mascrenghe 5
What is
The marriage between computer
science and molecular biology
The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
‘Information technology applied to
the management and analysis of
biological data’
Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
M.Alroy Mascrenghe 6
Biology Chemistry
Computer
Science Statistics
Bioinformatics
M.Alroy Mascrenghe 7
What is..
This is the age of the Information
Technology
However storing info is nothing new
Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
‘Bioinformatics tries to determine
what info is biologically important’
M.Alroy Mascrenghe 8
Basics
of
Molecular Biology….
M.Alroy Mascrenghe 9
DNA & Genes
DNA is where the genetic information is
stored
Blonde hair and blue eyes are inherited by
this
Gene - The basic unit of heredity
There are genes for characteristics i.e. a gene
for blond hair etc
Genes contain the information as a
sequence of nucleotides
Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
Genes are made up of nucleotides
M.Alroy Mascrenghe 10
M.Alroy Mascrenghe 11
Nucleotide (nt)
Each nt is made up of
Sugar
Phospate group
Base
The base it (nt) contains makes the only
difference between one nt and the other
There are 4 different bases
G(uanine),A(denine),T(hymine),C(ytosine)
The information is in the order of nucleotide
and the order is the info
Genes can be many thousands of nt long
The complete set of genetic instructions is
called genomes
M.Alroy Mascrenghe 12
Chromosomes
M.Alroy Mascrenghe 13
Double Helix
The DNA is a double helix
Each strand has complementary
information
Each particular base in one strand is
bonded with another particular base in the
next strand
G- C
A- T
For example -
AATGC one strand
TTACG other strand
M.Alroy Mascrenghe 14
Proteins
Proteins are very important
biological feature
Amino Acids make up the proteins
20 different amino acids are there
The function of a protein is
dependant on the order of the amino
acids
M.Alroy Mascrenghe 15
Proteins…
The information required to make aa is
stored in DNA
DNA sequence determines amino acid
sequence
Amino Acid sequence determines protein
structure
Protein structure determines protein
function
A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
Storage - DNA
Information Transfer – RNA
RNA is the message boy!
M.Alroy Mascrenghe 16
Central dogma
M.Alroy Mascrenghe 17
M.Alroy Mascrenghe 18
Proteins…..
Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
So in triplet codes – codon – protein
information is carried
The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)
M.Alroy Mascrenghe 21
Bioinformatics
Techniques…..
M.Alroy Mascrenghe 22
Prediction and Pattern
Recognition
The two main areas of bioinformatics
are
Pattern recognition
‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
Prediction
From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
M.Alroy Mascrenghe 23
Dot plots….
Simple way of evaluating
similarity between two
sequences
In a graph one sequence is on
one side the next on the other
side
Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe 24
M.Alroy Mascrenghe 25
Alignments
A match for similarity between the characters of two or
more sequences
Eg.
TTACTATA
TAGATA
There are so many ways to align the above two
sequences
1.
TTACTATA
TAGATA
2.
TTACTATA
TAGATA
3.
TTACTATA
TAGATA
So which one do we choose and on what basis?
Solution is to Provide a match score and mismatch score
M.Alroy Mascrenghe 26
Gaps
Introduce gaps and a penalty
score for gaps
TTACTATA
T_A_GATA
M.Alroy Mascrenghe 27
Scoring Matrix
For DNA/protein sequence alignment we create a matrix
If A and A score is 1
If A and T score is -5
If A and C score is -1
M.Alroy Mascrenghe 28
Dynamic Programming
As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
We cannot perform an exhaustive search
Combinatorial explosion occurs – too much
combinations to search for
Dynamic programming is a way of using
heuristics to search in the most promising
path
M.Alroy Mascrenghe 29
Databases
Sequence info is stored in
databases
So that they can be manipulated
easily
The db (next slide) are located
at diff places
They exchange info on a daily
basis so that they are up-to-date
and are in sync
Primary db – sequence data
M.Alroy Mascrenghe 30
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to
SWISS-PROT
NRL-3D
Composite DB
As there are many db which one to
search? Some are good in some
aspects and weak in others?
Composite db is the answer – which
has several db for its base data
Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
M.Alroy Mascrenghe 32
Composite DB
GenBank
NRL-3D
M.Alroy Mascrenghe 33
Secondary db
Store secondary structure info
or results of searches of the
primary db
Compo Primary
DB Source
PROSITE SWISS-PROT
PRINTS OWL
M.Alroy Mascrenghe 34
Database Searches
We have sequenced and identified
genes. So we know what they do
The sequences are stored in
databases
So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
M.Alroy Mascrenghe 35
So heuristics must be used again.
Areas in
Bioinformatics…
M.Alroy Mascrenghe 36
Genomics
Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
M.Alroy Mascrenghe 37
Genomics - Finding Genes
Gene in sequence data – needle in a
haystack
However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
Is whole array of nt we try to find and
border mark a set o nt as a gene
This is one of the challenges of
bioinformatics
Neural networks and dynamic
programming are being employed
M.Alroy Mascrenghe 38
Organism Genome Gene Web Site
Size Number
(Mb)
bp * 1,000,000
M.Alroy Mascrenghe 41
Protein Structure Prediction
M.Alroy Mascrenghe 42
Structure Prediction methods
Comparative Modeling
Target proteins structure is
compared with related proteins
Proteins with similar sequences
are searched for structures
M.Alroy Mascrenghe 43
Phylogenetics
The taxonomical system reflects
evolutionary relationships
Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
Rooted trees where there is only one
ancestor
Un rooted trees just showing the
relationship
Phylogenetic tree reconstruction algorithms
are also an area of research
M.Alroy Mascrenghe 44
Applications….
M.Alroy Mascrenghe 45
Medical Implications
Pharmacogenomics
Not all drugs work on all patients, some good
drugs cause death in some patients
So by doing a gene analysis before the
treatment the offensive drugs can be avoided
Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
Customized treatment
Gene Therapy
Replace or supply the defective or missing gene
E.g: Insulin and Factor VIII or Haemophilia
BioWeapons (??)
M.Alroy Mascrenghe 46
Diagnosis of Disease
Diagnosis of disease
Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
Death in 10-15 years
The gene responsible for the disease has
been identified
Contains excessively repeated sections of
CAG
So once analyzed the couple can be
counseledM.Alroy Mascrenghe 47
Drug Design
Can go up to 15yrs and
$700million
One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
The process
Discovery
Computational methods can
improves this
Testing
M.Alroy Mascrenghe 48
Discovery
Target identification
Identifying the molecule on which the
germs relies for its survival
Then we develop another molecule
i.e. drug which will bind to the target
So the germ will not be able to interact
with the target.
Proteins are the most common targets
M.Alroy Mascrenghe 49
Discovery…
For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
This HIV protease has an active
site where it binds to other
molecules
So HIV drug will go and bind
with that active site
Easily said than done!
M.Alroy Mascrenghe 50
Discovery…
M.Alroy Mascrenghe 51
Related Computer
Technology………….
M.Alroy Mascrenghe 52
PERL
Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
The default CGI language
It started out as a scripting language
but has become a fully fledged
language
IT has everything now, even web
service support
http://bio.perl.org
M.Alroy Mascrenghe 53
Data bases and Mining
Lot of the sequence databases are
available publicly
As there is a DB involved various
data mining techniques are used to
pull the data out
As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
M.Alroy Mascrenghe 54
European Molecular Biology
Network (EMBnet)
A central system for sharing, training
and centralizing up to date bio info
Some of the EMBnet sites are:
SQENET
http://www.seqnet.dl.ac.uk
UCL
http://www.biochem.ucl.ac.uk/bsm/dbbr
owser/embnet/
EBI – European Bioinformatics
Institute
www.ebi.ac.uk
M.Alroy Mascrenghe 55
References
Dan E. Krane and Michael L. Raymer
Basic Concepts of Bioinformatics
Arthur M Lesk
Intro to Bioinformatics
M.Alroy Mascrenghe 57