You are on page 1of 46

“IN THE NAME OF

ALLAH THE MOST


BENEFICENT, THE
MERCIFUL”.
SRI International
Bioinformatics
TOOLS OF
BIOINFORMATICS
“The greatest tragedy of Science is the
death of a beautiful idea by an ugly
FACT!”
Huxley
House of Knowledge
SRI International
Bioinformatics
Bioinformatics: putting A,T,C,G’s into
computer …

In order to make sense out of it and


facilitate further experiments by inference,
modeling and computer simulations.
What is bioinformatics?

“Roughly, bioinformatics describes any use of


computers to handle biological information. In
practice the definition used by most people is
narrower; bioinformatics to them is a synonym
for "computational molecular biology“ -- the use
of computers to characterize the molecular
components of living things.”

“The mathematical, statistical and computing


methods that aim to solve biological problems
using DNA and amino acid sequences and related
information.”
Bioinformatics vs.
computational biology:

"Computational biology is not a "field", but an "approach"


involving the use of computers to study biological
processes and hence it is an area as diverse as biology
itself."

Richard Durbin, Head of Informatics at the Wellcome Trust


Sanger Institute:
"I do not think all biological computing is bioinformatics,
e.g. mathematical modeling is not bioinformatics, even
when connected with biology-related problems. In my
opinion, bioinformatics has to do with management and the
subsequent use of biological information, particular genetic
information."
The overarching function of any bioinformatics
tool is the comparison one or more data entities.
“Pattern Matching”

This is the prescription for Discovery in the


21st Century
Related fields:

 Genomics (functional, structural)


 Proteomics
 Cheminformatics
 Pharmacogenomics
 Medical Informatics
Bioinformatics: genes, proteins and computers …

Bio-Polymer (alphabet) Process

DNA (A,T,G,C) replication


transcription

mRNA (U,A,C,G) splicing


translation

Proteins (20 a.a.) folding


interactions

Lipids, polysaccharides, membranes and signal transduction, environmental signals etc.


The Big Picture

Bioinformatics is:
 driven by the generation of data,
 moderated by hardware and analysis methods

Computing power

Analysis methods

Data generation platforms


From genes to proteins: protein
machinery of life

 Genes determine protein sequences

 Proteins are crucial agents in living organisms

 Understanding genes = understanding


proteins with their structure and function
Significance of protein folding problem

V
L
S
E
G
E
W
Q
L O2
V
L
V
.
.
.
Sequence structure function
to perform a
folds into a 3D
Deciphering protein structure and function

 Experiment (X-ray): months

 Atomistic (physical principles based)


simulations: weeks
 Homology based modeling: hours
 Sequence similarity based annotations:
seconds
Assigning fold and function utilizing similarity to
experimentally characterized proteins:

Sequence similarity: BLAST and


others

Beyond sequence similarity:


matching sequences and shapes
(threading)
What Is a Biological Database?
A biological database is a large, organized body of
persistent data
 usually associated with computerized software
designed to
 update,
 query,
 and retrieve

components of the data stored within the system.


 May be a single file containing many records
Importance of bioinformatics databases:

 DNA, mRNA’s sequences, genes:


GenBank  NCBI HomePage.htm
 Protein and nucleic acid structures:
Protein Data Bank (PDB) 
www.google.com
 Protein motifs: PROSITE
 Protein families: PFAM
What are the Tools ?
Where are the Tools ?

• Commercial/Proprietary SW-Tools
• Public Domain Software
• Internet OnLine Resources
NAILS
=
DATA
HAMMER
=

BIOINFORMATICS
Bioinformatics Tools

 Internet,Google – wide array of tools, mostly free


and open source, now exist for use
 Do not reinvent the wheel!
 But do be vary, Bioinformatician != Programmer
 Most common tools are written in C, C++, Java or
PERL; others in Fortran, Python
Bioinformatics Tool Using

 Analyze multiple genomes to track down a genetic


disease
 Determine the structure of a protein to enable
drug development
 Design ways to use proteins and signal
processing to make DNA sequencing cheaper
Bioinformatics Tool Building

 Apply statistics, programming, and biochemistry


to predict protein structure
 Create a genome browser to answer 125,000
questions a day
 Developing a signal processing algorithm to read
DNA
Bioinformatic Tools

 Nucleotide Sequence Analysis Tools


Basic Local Alignment Search Tool (BLAST) for
comparing gene and protein sequences against
others
Electronic PCR allows to search DNA sequence for
sequence tagged sites (STSs)
-- compares the query sequence against data in
NCBI's
Entrez Gene encapsulates a wide range of
information for a given gene and organism
Model Maker gives assembled genomic sequence
to build a gene model
Protein Sequence Analysis Tools

 The Basic Local Alignment Search Tool (BLAST)

 BLink - ("BLAST Link") displays the results of BLAST


searches

 CD Search search the Conserved Domain Database

 CDART displays the functional domains of a protein query


sequence

 TaxPlot - a tool for 3-way comparisons of genomes


SRI International

Structural Analysis Tools Bioinformatics

 Cn3D - a helper application to view 3-dimensional


structures
 VAST Search structure-structure similarity search
service
 CD Search - Conserved Domain Database

Genome Analysis Tools


Entrez Genomes – shows whole genomes of over
1000 organisms
COGs - Clusters of Orthologous Groups - consists
of individual proteins or groups of paralogs
Map Viewer - integrated views of chromosome maps
for many organisms
Molecular Modeling Tools

CHARMM(Chemistry at HARvard Macromolecular


Mechanics)
is a molecular dynamics simulation package associated
with the force field of the same name.

GROMACS(Groningen Machine for Chemical


Simulations)
is a molecular dynamics simulation package developed in
the University of Groningen. It’s one of the fastest
programs for molecular simulations to date. Besides, the
support of different force fields and the open source (GPL)
character make GROMACS very flexible.
Phylogenetic Analysis Tools

MERLIN (University of Michigan)


uses sparse trees to represent gene flow in pedigrees and is
one of the fastest pedigree analysis packages.
phyml (CNRS, France)
is a phylogenetic program that uses Maximum Likelihood
to build phylogenetic trees.
Multiple Sequence Alignments Tools

ClustalW(-mpi)
produces biologically meaningful multiple sequence
alignments of divergent sequences.
t-coffee (CNRS, France)
is a multiple sequence alignment package.
Sequence Analysis Tools

blast (NCBI)
fast similarity searches in biological sequence databases.

pftools (SIB, Switzerland)


search tools for sequence motifs (i.e. protein domains,
protein families, promoter sites, …) in biological sequences
using motif descriptors called profiles.

hmmer (Washington University School of Medicine)


search tool similar to the pftools package, but uses HMM’s
as motif descriptors.
MutDB (http://www.mutdb.org)

MutDB provides structural


annotations for disease-
associated mutations and
single nucleotide
polymorphisms (SNPs)
Structural Mutation Service

 Mutations on MutDB
are mapped to protein
structure
 Extension in Chimera
queries MutDB

UCSF Chimera extension


PyMol Extension

Controller window
identifies mapped
mutation positions
which are highlighted
structurally
Future work
Web services for identifying regions of structural similarity
between a query protein and a database of protein structures

Chimera PyMOL

matplotlib
Examples of Bioinformatics databases
 Database interfaces
 Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …

 Sequence alignment
 BLAST, FASTA

 Multiple sequence alignment


 Clustal, MultAlin, DiAlign

 Gene finding
 Genscan, GenomeScan, GeneMark, GRAIL

 Protein Domain analysis and identification


 pfam, BLOCKS, ProDom,

 Pattern Identification/Characterization
 Gibbs Sampler, AlignACE, MEME

 Protein Folding prediction


 PredictProtein, SwissModeler
Five websites that all biologists should know

 NCBI (The National Center for Biotechnology Information;


 http://www.ncbi.nlm.nih.gov/

 EBI (The European Bioinformatics Institute)


 http://www.ebi.ac.uk/

 The Canadian Bioinformatics Resource


 http://www.cbr.nrc.ca/

 SwissProt/ExPASy (Swiss Bioinformatics Resource)


 http://expasy.cbr.nrc.ca/sprot/

 PDB (The Protein Databank)


 http://www.rcsb.org/PDB/
NCBI (http://www.ncbi.nlm.nih.gov/)

 Entrez interface to databases


 Medline/OMIM

 Genbank/Genpept/Structures

 BLAST server(s)
 Five-plus flavors of blast

 Draft Human Genome


 Much, much more…
EBI (http://www.ebi.ac.uk/)

 SRS database interface


 EMBL, SwissProt, and many more

 Many server-based tools


 ClustalW, DALI, …
SwissProt (http://expasy.cbr.nrc.ca/sprot/)

 Curation!!!
 Error rate in the information is greatly reduced in comparison
to most other databases.
 Extensive cross-linking to other data sources
 SwissProt is the ‘gold-standard’ by which other
databases can be measured, and is the best place
to start if you have a specific protein to investigate
A few more resources to be aware of
 Human Genome Working Draft
 http://genome.ucsc.edu/

 TIGR (The Institute for Genomics Research)


 http://www.tigr.org/

 Celera
 http://www.celera.com/

 (Model) Organism specific information:


 Yeast: http://genome-www.stanford.edu/Saccharomyces/

 Arabidopis: http://www.tair.org/

 Mouse: http://www.jax.org/

 Fruitfly: http://www.fruitfly.org/

 Nematode: http://www.wormbase.org/

 Nucleic Acids Research Database Issue


 http://nar.oupjournals.org/ (First issue every year)
Sources for Tools
 http://sourceforge.net (51 projects)
 http://bioinformatics.org (156 projects)
 ftp://ftp.ncbi.nlm.nih.gov (NCBI)
 http://www.ebi.ac.uk/Tools/ (EMBL)
 http://www.blueprint.org (Blueprint Initiative)
 http://www.geocities.com/bioinformaticsweb/
toollink.html (Suresh’s Links)
 http://www.agr.kuleuven.ac.be/vakken/i287/
bioinformatica.htm (Dutch course page)
Bioinformatics: The future

 More complete genomes


 Phylogenetics

 Functional genomics
 Annotation, experimental design, integration

 Pathways
 Current DBs incomplete

 Data model?

 Processes
 How to model?

 System biology; towards prediction


What are the issues?

 Market definition
 “Bioinformatics” is poorly defined/segmented

 Commodity pricing
 Customers are conditioned to use “point & click” black boxes

 Value is disguised

 Service mentality
 Technology seen as subservient to wet lab data
Bioinformatics In Pakistan

 Relatively newer concept in Pakistan

 Fewerwork is being done at academic as well as


research level

 Awareness is increasung in our


biotechnonlogists in this regard

 Along way is there to go ahead


Morals of the story…
Bioinformatics offer many challenging tasks i.e.

•research on novel scalable high performance


segmentation of high dimensional and high volume
feature spaces

•Development and evaluation of novel high


performance techniques for data mining

•research on novel scalable data(base) structures


for efficient data querying, analysis and mining of
high volume data sets

You might also like