You are on page 1of 57

How Bioinformatics can change your life

Basic Concepts of
Bioinformatics
TOC
 Introduction
 Basic concepts in Molecular biology
 Bioinformatics techniques
 Areas in bioinformatics
 Applications
 Related Computer Technology
 Conference in Glasgow
 Acknowledgements
 Reference
M.Alroy Mascrenghe 2
Introduction……

M.Alroy Mascrenghe 3
2000
 A Major event happened that was to
change the course of human history
 It was a joint British and American
effort
 nothing to do with IRAQ!
 It was a race – who will complete
first
 Race Test – not whether they have
taken drugs but whether they can
produce them!
 Human genome was sequenced
M.Alroy Mascrenghe 4
A Situ…somewhere in the
near future
 A virus –not ‘I love you’ virus- creates an epidemic
 Geneticists and bioinformaticians role on their
sleeves
 Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
 As the characteristics of the other viruses are
known
 From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
 When the protein (sequence and structure) is
known then medicines can be designed

M.Alroy Mascrenghe 5
What is
 The marriage between computer
science and molecular biology
 The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
 ‘Information technology applied to
the management and analysis of
biological data’
 Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each

M.Alroy Mascrenghe 6
Biology Chemistry

Computer
Science Statistics

Bioinformatics

M.Alroy Mascrenghe 7
What is..
 This is the age of the Information
Technology
 However storing info is nothing new
 Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
 ‘Bioinformatics tries to determine
what info is biologically important’

M.Alroy Mascrenghe 8
Basics
of
Molecular Biology….

M.Alroy Mascrenghe 9
DNA & Genes
 DNA is where the genetic information is
stored
 Blonde hair and blue eyes are inherited by
this
 Gene - The basic unit of heredity
 There are genes for characteristics i.e. a gene
for blond hair etc
 Genes contain the information as a
sequence of nucleotides
 Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
 Genes are made up of nucleotides

M.Alroy Mascrenghe 10
M.Alroy Mascrenghe 11
Nucleotide (nt)
 Each nt is made up of
 Sugar
 Phospate group
 Base
 The base it (nt) contains makes the only
difference between one nt and the other
 There are 4 different bases
 G(uanine),A(denine),T(hymine),C(ytosine)
 The information is in the order of nucleotide
and the order is the info
 Genes can be many thousands of nt long
 The complete set of genetic instructions is
called genomes

M.Alroy Mascrenghe 12
Chromosomes

 DNA strings make


chromosomes
 Analogy
 Letters - nt
 Sentences – genes

 Individual volumes of Britannica


encyclopedia – chromosomes
 All voles together - Genome

M.Alroy Mascrenghe 13
Double Helix
 The DNA is a double helix
 Each strand has complementary
information
 Each particular base in one strand is
bonded with another particular base in the
next strand
G- C
A- T
 For example -
 AATGC one strand
 TTACG other strand

M.Alroy Mascrenghe 14
Proteins
 Proteins are very important
biological feature
 Amino Acids make up the proteins
 20 different amino acids are there
 The function of a protein is
dependant on the order of the amino
acids

M.Alroy Mascrenghe 15
Proteins…
 The information required to make aa is
stored in DNA
 DNA sequence determines amino acid
sequence
 Amino Acid sequence determines protein
structure
 Protein structure determines protein
function
 A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
 Storage - DNA
 Information Transfer – RNA
 RNA is the message boy!
M.Alroy Mascrenghe 16
Central dogma

DNA transcription RNA Translation Protein


RNA Polymerase Ribosomes

M.Alroy Mascrenghe 17
M.Alroy Mascrenghe 18
Proteins…..
 Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
 So in triplet codes – codon – protein
information is carried
 The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)

 Some codons are used as start


codons - AUG as well as to code
methionine
M.Alroy Mascrenghe 19
Protein Structure
 Shows a wide variety as opposed to the
DNA whose structure is uniform
 X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
 Structure is related to the function or rather
structure determines the function
 Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.
 If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
 Only in the native structure the proteins
functions well
 Even after the translation is over protein 20
M.Alroy Mascrenghe
goes through some changes to its structure
Gene Expression
 Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
 Where do the genes begin in a
chromosome?
 How does the RNA identify the beginning
of a gene to make a protein
 A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently
 But a particular combination of a nucleotide
can be
 Promoter sequences – the order of nt
which mark the beginning of a gene

M.Alroy Mascrenghe 21
Bioinformatics
Techniques…..

M.Alroy Mascrenghe 22
Prediction and Pattern
Recognition
 The two main areas of bioinformatics
are
 Pattern recognition
 ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
 Prediction
 From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
M.Alroy Mascrenghe 23
Dot plots….
 Simple way of evaluating
similarity between two
sequences
 In a graph one sequence is on
one side the next on the other
side
 Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe 24
M.Alroy Mascrenghe 25
Alignments
 A match for similarity between the characters of two or
more sequences
 Eg.
 TTACTATA
 TAGATA
 There are so many ways to align the above two
sequences
 1.
 TTACTATA
 TAGATA
 2.
 TTACTATA
 TAGATA
 3.
 TTACTATA
 TAGATA
 So which one do we choose and on what basis?
 Solution is to Provide a match score and mismatch score

M.Alroy Mascrenghe 26
Gaps
 Introduce gaps and a penalty
score for gaps
 TTACTATA
 T_A_GATA

 However not all gaps are bad


 TTGCAATCT
 CAA
 How do we align?
 ---CAA---
 These gaps are not biologically significant
 Semi Global Alignments

M.Alroy Mascrenghe 27
Scoring Matrix
 For DNA/protein sequence alignment we create a matrix
 If A and A score is 1
 If A and T score is -5
 If A and C score is -1

M.Alroy Mascrenghe 28
Dynamic Programming
 As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
 We cannot perform an exhaustive search
 Combinatorial explosion occurs – too much
combinations to search for
 Dynamic programming is a way of using
heuristics to search in the most promising
path
M.Alroy Mascrenghe 29
Databases
 Sequence info is stored in
databases
 So that they can be manipulated
easily
 The db (next slide) are located
at diff places
 They exchange info on a daily
basis so that they are up-to-date
and are in sync
 Primary db – sequence data
M.Alroy Mascrenghe 30
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to
SWISS-PROT
NRL-3D
Composite DB
 As there are many db which one to
search? Some are good in some
aspects and weak in others?
 Composite db is the answer – which
has several db for its base data
 Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db

M.Alroy Mascrenghe 32
Composite DB

 OWL has these as their primary


db
 SWISS PROT (top priority)
 PIR

 GenBank

 NRL-3D

M.Alroy Mascrenghe 33
Secondary db
 Store secondary structure info
or results of searches of the
primary db

Compo Primary
DB Source
PROSITE SWISS-PROT

PRINTS OWL

M.Alroy Mascrenghe 34
Database Searches
 We have sequenced and identified
genes. So we know what they do
 The sequences are stored in
databases
 So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
 Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
M.Alroy Mascrenghe 35
 So heuristics must be used again.
Areas in
Bioinformatics…

M.Alroy Mascrenghe 36
Genomics
 Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
 i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates

M.Alroy Mascrenghe 37
Genomics - Finding Genes
 Gene in sequence data – needle in a
haystack
 However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
 Is whole array of nt we try to find and
border mark a set o nt as a gene
 This is one of the challenges of
bioinformatics
 Neural networks and dynamic
programming are being employed

M.Alroy Mascrenghe 38
Organism Genome Gene Web Site
Size Number
(Mb)
bp * 1,000,000

Yeast 13.5 6,241 http://genome-


www.stanford.ed
u/Saccharomyce
s
Fruit Flies 180 13,601 http://flybase.bio.
indiana.edu
Homo 3,000 45,000 http://www.ncbi.n
Sapiens lm.nih.gov/geno
me/guide
Proteomics
 Proteome is the sum total of an
organisms proteins
 More difficult than genomics
 4 20
 Simple chemical makeup complex
 Can duplicate can’t
 We are entering into the ‘post
genome era’
 Meaning much has been done with
the Genes – not that it’s a over
M.Alroy Mascrenghe 40
Proteomics…..
 The relationship between the RNA and the protein it codes are
usually very different
 After translation proteins do change
 So aa sequence do not tell anything about the post
translation changes
 Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
 So aa only hint in these things
 Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material

M.Alroy Mascrenghe 41
Protein Structure Prediction

 Is one of the biggest challenges


of bioinformatics and esp.
biochemistry
 No algorithm is there now to
consistently predict the structure
of proteins

M.Alroy Mascrenghe 42
Structure Prediction methods

 Comparative Modeling
 Target proteins structure is
compared with related proteins
 Proteins with similar sequences
are searched for structures

M.Alroy Mascrenghe 43
Phylogenetics
 The taxonomical system reflects
evolutionary relationships
 Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
 Rooted trees where there is only one
ancestor
 Un rooted trees just showing the
relationship
 Phylogenetic tree reconstruction algorithms
are also an area of research

M.Alroy Mascrenghe 44
Applications….

M.Alroy Mascrenghe 45
Medical Implications
 Pharmacogenomics
 Not all drugs work on all patients, some good
drugs cause death in some patients
 So by doing a gene analysis before the
treatment the offensive drugs can be avoided
 Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
 Customized treatment
 Gene Therapy
 Replace or supply the defective or missing gene
 E.g: Insulin and Factor VIII or Haemophilia

 BioWeapons (??)

M.Alroy Mascrenghe 46
Diagnosis of Disease
 Diagnosis of disease
 Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
 Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
 Death in 10-15 years
 The gene responsible for the disease has
been identified
 Contains excessively repeated sections of
CAG
 So once analyzed the couple can be
counseledM.Alroy Mascrenghe 47
Drug Design
 Can go up to 15yrs and
$700million
 One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
 The process
 Discovery
 Computational methods can
improves this
 Testing
M.Alroy Mascrenghe 48
Discovery

Target identification
 Identifying the molecule on which the
germs relies for its survival
 Then we develop another molecule
i.e. drug which will bind to the target
 So the germ will not be able to interact
with the target.
 Proteins are the most common targets

M.Alroy Mascrenghe 49
Discovery…
 For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
 This HIV protease has an active
site where it binds to other
molecules
 So HIV drug will go and bind
with that active site
 Easily said than done!
M.Alroy Mascrenghe 50
Discovery…

 Lead compounds are the


molecules that go and bind to
the target protein’s active site
 Traditionally this has been a trial
and error method
 Now this is being moved into the
realm of computers

M.Alroy Mascrenghe 51
Related Computer
Technology………….

M.Alroy Mascrenghe 52
PERL
 Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
 The default CGI language
 It started out as a scripting language
but has become a fully fledged
language
 IT has everything now, even web
service support
 http://bio.perl.org
M.Alroy Mascrenghe 53
Data bases and Mining
 Lot of the sequence databases are
available publicly
 As there is a DB involved various
data mining techniques are used to
pull the data out
 As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
M.Alroy Mascrenghe 54
European Molecular Biology
Network (EMBnet)
 A central system for sharing, training
and centralizing up to date bio info
 Some of the EMBnet sites are:
 SQENET
 http://www.seqnet.dl.ac.uk
 UCL
 http://www.biochem.ucl.ac.uk/bsm/dbbr
owser/embnet/
 EBI – European Bioinformatics
Institute
 www.ebi.ac.uk
M.Alroy Mascrenghe 55
References
 Dan E. Krane and Michael L. Raymer
 Basic Concepts of Bioinformatics

 Arthur M Lesk
 Intro to Bioinformatics

 T.K. Attwood & D. J. Parry-Smith


 Intro to Bioinformatics

 The genetic Revolution


 Dr Patrick Dixon

 Prof David Gilbert’s Site


 http://www.brc.dcs.gla.ac.uk/~drg/
M.Alroy Mascrenghe 56
Thank You!

M.Alroy Mascrenghe 57

You might also like