Bio Informatics

“IN THE NAME OF
ALLAH THE MOST

BENEFICENT, THE
MERCIFUL”.
SRI International
Bioinformatics
TOOLS OF
BIOINFORMATICS
“The greatest tragedy of Science is the
death of a beautiful idea by an ugly
FACT!”
Huxley
House of Knowledge
SRI International
Bioinformatics
Bioinformatics: putting A,T,C,G’s into
computer …
In order to make sense out of it and

facilitate further experiments by inference,
modeling and computer simulations.
What is bioinformatics?
“Roughly, bioinformatics describes any use of

computers to handle biological information. In
practice the definition used by most people is
narrower; bioinformatics to them is a synonym
for "computational molecular biology“ -- the use
of computers to characterize the molecular
components of living things.”
“The mathematical, statistical and computing

methods that aim to solve biological problems
using DNA and amino acid sequences and related
information.”
Bioinformatics vs.
computational biology:
"Computational biology is not a "field", but an "approach"

involving the use of computers to study biological
processes and hence it is an area as diverse as biology
itself."
Richard Durbin, Head of Informatics at the Wellcome Trust

Sanger Institute:
"I do not think all biological computing is bioinformatics,
e.g. mathematical modeling is not bioinformatics, even
when connected with biology-related problems. In my
opinion, bioinformatics has to do with management and the
subsequent use of biological information, particular genetic
information."
The overarching function of any bioinformatics
tool is the comparison one or more data entities.
“Pattern Matching”
This is the prescription for Discovery in the

21st Century
Related fields:
 Genomics (functional, structural)

 Proteomics
 Cheminformatics
 Pharmacogenomics
 Medical Informatics
Bioinformatics: genes, proteins and computers …
Bio-Polymer (alphabet) Process
DNA (A,T,G,C) replication

transcription
mRNA (U,A,C,G) splicing

translation
Proteins (20 a.a.) folding

interactions
Lipids, polysaccharides, membranes and signal transduction, environmental signals etc.

The Big Picture
Bioinformatics is:
 driven by the generation of data,
 moderated by hardware and analysis methods
Computing power
Analysis methods
Data generation platforms

From genes to proteins: protein
machinery of life
 Genes determine protein sequences
 Proteins are crucial agents in living organisms
 Understanding genes = understanding

proteins with their structure and function
Significance of protein folding problem
V
L
S
E
G
E
W
Q
L O2
V
L
V
.
.
.
Sequence structure function
to perform a
folds into a 3D
Deciphering protein structure and function
 Experiment (X-ray): months
 Atomistic (physical principles based)

simulations: weeks
 Homology based modeling: hours
 Sequence similarity based annotations:
seconds
Assigning fold and function utilizing similarity to
experimentally characterized proteins:
Sequence similarity: BLAST and

others
Beyond sequence similarity:

matching sequences and shapes
(threading)
What Is a Biological Database?
A biological database is a large, organized body of
persistent data
 usually associated with computerized software
designed to
 update,
 query,
 and retrieve
components of the data stored within the system.

 May be a single file containing many records
Importance of bioinformatics databases:
 DNA, mRNA’s sequences, genes:

GenBank  NCBI HomePage.htm
 Protein and nucleic acid structures:
Protein Data Bank (PDB) 
www.google.com
 Protein motifs: PROSITE
 Protein families: PFAM
What are the Tools ?
Where are the Tools ?
• Commercial/Proprietary SW-Tools
• Public Domain Software
• Internet OnLine Resources
NAILS
=
DATA
HAMMER
=
BIOINFORMATICS
Bioinformatics Tools
 Internet,Google – wide array of tools, mostly free

and open source, now exist for use
 Do not reinvent the wheel!
 But do be vary, Bioinformatician != Programmer
 Most common tools are written in C, C++, Java or
PERL; others in Fortran, Python
Bioinformatics Tool Using
 Analyze multiple genomes to track down a genetic

disease
 Determine the structure of a protein to enable
drug development
 Design ways to use proteins and signal
processing to make DNA sequencing cheaper
Bioinformatics Tool Building
 Apply statistics, programming, and biochemistry

to predict protein structure
 Create a genome browser to answer 125,000
questions a day
 Developing a signal processing algorithm to read
DNA
Bioinformatic Tools
 Nucleotide Sequence Analysis Tools

Basic Local Alignment Search Tool (BLAST) for
comparing gene and protein sequences against
others
Electronic PCR allows to search DNA sequence for
sequence tagged sites (STSs)
-- compares the query sequence against data in
NCBI's
Entrez Gene encapsulates a wide range of
information for a given gene and organism
Model Maker gives assembled genomic sequence
to build a gene model
Protein Sequence Analysis Tools
 The Basic Local Alignment Search Tool (BLAST)
 BLink - ("BLAST Link") displays the results of BLAST

searches
 CD Search search the Conserved Domain Database
 CDART displays the functional domains of a protein query

sequence
 TaxPlot - a tool for 3-way comparisons of genomes

SRI International
Structural Analysis Tools Bioinformatics
 Cn3D - a helper application to view 3-dimensional

structures
 VAST Search structure-structure similarity search
service
 CD Search - Conserved Domain Database
Genome Analysis Tools

Entrez Genomes – shows whole genomes of over
1000 organisms
COGs - Clusters of Orthologous Groups - consists
of individual proteins or groups of paralogs
Map Viewer - integrated views of chromosome maps
for many organisms
Molecular Modeling Tools
CHARMM(Chemistry at HARvard Macromolecular

Mechanics)
is a molecular dynamics simulation package associated
with the force field of the same name.
GROMACS(Groningen Machine for Chemical

Simulations)
is a molecular dynamics simulation package developed in
the University of Groningen. It’s one of the fastest
programs for molecular simulations to date. Besides, the
support of different force fields and the open source (GPL)
character make GROMACS very flexible.
Phylogenetic Analysis Tools
MERLIN (University of Michigan)

uses sparse trees to represent gene flow in pedigrees and is
one of the fastest pedigree analysis packages.
phyml (CNRS, France)
is a phylogenetic program that uses Maximum Likelihood
to build phylogenetic trees.
Multiple Sequence Alignments Tools
ClustalW(-mpi)
produces biologically meaningful multiple sequence
alignments of divergent sequences.
t-coffee (CNRS, France)
is a multiple sequence alignment package.
Sequence Analysis Tools
blast (NCBI)
fast similarity searches in biological sequence databases.
pftools (SIB, Switzerland)

search tools for sequence motifs (i.e. protein domains,
protein families, promoter sites, …) in biological sequences
using motif descriptors called profiles.
hmmer (Washington University School of Medicine)

search tool similar to the pftools package, but uses HMM’s
as motif descriptors.
MutDB (http://www.mutdb.org)
MutDB provides structural

annotations for disease-
associated mutations and
single nucleotide
polymorphisms (SNPs)
Structural Mutation Service
 Mutations on MutDB
are mapped to protein
structure
 Extension in Chimera
queries MutDB
UCSF Chimera extension

PyMol Extension
Controller window
identifies mapped
mutation positions
which are highlighted
structurally
Future work
Web services for identifying regions of structural similarity
between a query protein and a database of protein structures
Chimera PyMOL
matplotlib
Examples of Bioinformatics databases
 Database interfaces
 Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
 Sequence alignment
 BLAST, FASTA
 Multiple sequence alignment

 Clustal, MultAlin, DiAlign
 Gene finding
 Genscan, GenomeScan, GeneMark, GRAIL
 Protein Domain analysis and identification

 pfam, BLOCKS, ProDom,
 Pattern Identification/Characterization
 Gibbs Sampler, AlignACE, MEME
 Protein Folding prediction

 PredictProtein, SwissModeler
Five websites that all biologists should know
 NCBI (The National Center for Biotechnology Information;

 http://www.ncbi.nlm.nih.gov/
 EBI (The European Bioinformatics Institute)

 http://www.ebi.ac.uk/
 The Canadian Bioinformatics Resource

 http://www.cbr.nrc.ca/
 SwissProt/ExPASy (Swiss Bioinformatics Resource)

 http://expasy.cbr.nrc.ca/sprot/
 PDB (The Protein Databank)

 http://www.rcsb.org/PDB/
NCBI (http://www.ncbi.nlm.nih.gov/)
 Entrez interface to databases

 Medline/OMIM
 Genbank/Genpept/Structures
 BLAST server(s)
 Five-plus flavors of blast
 Draft Human Genome

 Much, much more…
EBI (http://www.ebi.ac.uk/)
 SRS database interface

 EMBL, SwissProt, and many more
 Many server-based tools

 ClustalW, DALI, …
SwissProt (http://expasy.cbr.nrc.ca/sprot/)
 Curation!!!
 Error rate in the information is greatly reduced in comparison
to most other databases.
 Extensive cross-linking to other data sources
 SwissProt is the ‘gold-standard’ by which other
databases can be measured, and is the best place
to start if you have a specific protein to investigate
A few more resources to be aware of
 Human Genome Working Draft
 http://genome.ucsc.edu/
 TIGR (The Institute for Genomics Research)

 http://www.tigr.org/
 Celera
 http://www.celera.com/
 (Model) Organism specific information:

 Yeast: http://genome-www.stanford.edu/Saccharomyces/
 Arabidopis: http://www.tair.org/
 Mouse: http://www.jax.org/
 Fruitfly: http://www.fruitfly.org/
 Nematode: http://www.wormbase.org/
 Nucleic Acids Research Database Issue

 http://nar.oupjournals.org/ (First issue every year)
Sources for Tools
 http://sourceforge.net (51 projects)
 http://bioinformatics.org (156 projects)
 ftp://ftp.ncbi.nlm.nih.gov (NCBI)
 http://www.ebi.ac.uk/Tools/ (EMBL)
 http://www.blueprint.org (Blueprint Initiative)
 http://www.geocities.com/bioinformaticsweb/
toollink.html (Suresh’s Links)
 http://www.agr.kuleuven.ac.be/vakken/i287/
bioinformatica.htm (Dutch course page)
Bioinformatics: The future
 More complete genomes

 Phylogenetics
 Functional genomics
 Annotation, experimental design, integration
 Pathways
 Current DBs incomplete
 Data model?
 Processes
 How to model?
 System biology; towards prediction

What are the issues?
 Market definition
 “Bioinformatics” is poorly defined/segmented
 Commodity pricing
 Customers are conditioned to use “point & click” black boxes
 Value is disguised
 Service mentality
 Technology seen as subservient to wet lab data
Bioinformatics In Pakistan
 Relatively newer concept in Pakistan
 Fewerwork is being done at academic as well as

research level
 Awareness is increasung in our

biotechnonlogists in this regard
 Along way is there to go ahead

Morals of the story…
Bioinformatics offer many challenging tasks i.e.
•research on novel scalable high performance

segmentation of high dimensional and high volume
feature spaces
•Development and evaluation of novel high

performance techniques for data mining
•research on novel scalable data(base) structures

for efficient data querying, analysis and mining of
high volume data sets

Bio Informatics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bio Informatics

Uploaded by

Copyright:

Available Formats

“IN THE NAME OF

ALLAH THE MOST

In order to make sense out of it and

“Roughly, bioinformatics describes any use of

“The mathematical, statistical and computing

"Computational biology is not a "field", but an "approach"

Richard Durbin, Head of Informatics at the Wellcome Trust

This is the prescription for Discovery in the

 Genomics (functional, structural)

Bio-Polymer (alphabet) Process

DNA (A,T,G,C) replication

mRNA (U,A,C,G) splicing

Proteins (20 a.a.) folding

Lipids, polysaccharides, membranes and signal transduction, environmental signals etc.

Data generation platforms

 Genes determine protein sequences

 Proteins are crucial agents in living organisms

 Understanding genes = understanding

 Experiment (X-ray): months

 Atomistic (physical principles based)

Sequence similarity: BLAST and

Beyond sequence similarity:

components of the data stored within the system.

 DNA, mRNA’s sequences, genes:

 Internet,Google – wide array of tools, mostly free

 Analyze multiple genomes to track down a genetic

 Apply statistics, programming, and biochemistry

 Nucleotide Sequence Analysis Tools

 The Basic Local Alignment Search Tool (BLAST)

 BLink - ("BLAST Link") displays the results of BLAST

 CD Search search the Conserved Domain Database

 CDART displays the functional domains of a protein query

 TaxPlot - a tool for 3-way comparisons of genomes

Structural Analysis Tools Bioinformatics

 Cn3D - a helper application to view 3-dimensional

Genome Analysis Tools

CHARMM(Chemistry at HARvard Macromolecular

GROMACS(Groningen Machine for Chemical

MERLIN (University of Michigan)

pftools (SIB, Switzerland)

hmmer (Washington University School of Medicine)

MutDB provides structural

UCSF Chimera extension

 Multiple sequence alignment

 Protein Domain analysis and identification

 Protein Folding prediction

 NCBI (The National Center for Biotechnology Information;

 EBI (The European Bioinformatics Institute)

 The Canadian Bioinformatics Resource

 SwissProt/ExPASy (Swiss Bioinformatics Resource)

 PDB (The Protein Databank)

 Entrez interface to databases

 Draft Human Genome

 SRS database interface

 Many server-based tools

 TIGR (The Institute for Genomics Research)

 (Model) Organism specific information:

 Nucleic Acids Research Database Issue

 More complete genomes

 System biology; towards prediction

 Relatively newer concept in Pakistan

 Fewerwork is being done at academic as well as

 Awareness is increasung in our