You are on page 1of 55

Introduction to different resources of

bioinformatics and application


Bioinformatics Resources and Tools:

A database is a structured collection of records stored in a computer system. Genomic


databases typically store DNA or protein sequences as well as annotated information
about those sequences.

Many databases also provide bioinformatics tools, such as BLAST, for finding specific
sequences or annotations. There are hundreds of genomics databases: some are
comprehensive, but are not carefully curated (GenBank), while others are carefully
curated, but are narrow (FlyBase).
Bioinformatic tools are computer programs that analyze one or more sequences.

There are a dizzying array of bioinformatic tools that can analyze sequences to find protein
domains (Pfam), or that can search through databases of millions of sequences to find ones that
are similar (BLAST) or that can find potential protein-coding regions (ORF-Finder).

Many are freely available over the web. It can be overwhelming to find and use bioinformatic tools
because you need to know
1) what type of analysis you want to perform
2) what type of tool to use
3) where to find the tool.
The Online Bioinformatics Resources Collection (OBRC) contains annotations and links for
2428 bioinformatics databases and software tools.

DNA Sequence Databases and Analysis Tools (460)

Enzymes and Pathways (241)

Gene Mutations, Genetic Variations and Diseases (247)

Genomics Databases and Analysis Tools (632)

Immunological Databases and Tools (48)

Microarray, SAGE, and other Gene Expression (164)


Organelle Databases (23)

Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and
others) (143)

Plant Databases (146)

Protein Sequence Databases and Analysis Tools (404)

Proteomics Resources (58)

RNA Databases and Analysis Tools (216)

Structure Databases and Analysis Tools (378)


Nucleotide Sequence Databases (the principal ones)

NCBI - National Center for Biotechnology Information.


NCBI houses sets of databases of sequences for everything under the sun.

EBI - European Bioinformatics Institute

DDBJ - DNA Data Bank of Japan


A public database of annotated nucleotide sequences. Includes the Japanese
Genotype-phenotype Archive (JGA), personal genotype and phenotype data from
individuals who have signed consent agreements authorizing data release only for
specific research uses.

Database Searching by Sequence Similarity


BLAST @ NCBI
PSI-BLAST @ NCBI
FASTA @ EBI
BLAT Jim Kent's Blat is just superb in terms of speed and the integrated view you get
for viewing the results
Basic Local Alignment Search Tool
BLAST finds regions of similarity between biological sequences. The program compares
nucleotide or protein sequences to sequence databases and calculates the statistical
significance.

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between
sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches. BLAST can be used to
infer functional and evolutionary relationships between sequences as well as help identify
members of gene families
BLAT@UCSC

BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25
bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect
sequence matches of 20 bases. BLAT on proteins finds sequences of 80% and greater similarity
of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein
blat on land vertebrates.

FASTA @ EBI

This tool provides sequence similarity searching against protein databases using the FASTA
suite of programs. FASTA provides a heuristic search with a protein query.

FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH
(local), GGSEARCH (global) and GLSEARCH (global query, local database)
Issues to consider

► Dealing with gaps


► Do we want gaps in alignment?
► What are disadvantages of
► Many small gaps?
► Some big gaps?
Summary
► Why are biological sequences similar to one
another?
► Knowledge of how and why sequences change
over time can help you interpret similarities and
differences between them
Why do similarity search?
► Similarity indicates conserved function
► Human and mouse genes are more than 80% similar at sequence level
► But these genes are small fraction of genome
► Most sequences in the genome are not recognizably similar
► Comparing sequences helps us understand function
► Locate similar gene in another species to understand your new gene
► Rosetta stone
BLAST

► Basic Local Alignment Search Tool


► Algorithm for comparing a given sequence against
sequences in a database
► A match between two sequences is an alignment
► Many BLAST databases and web services available
Example BLAST questions

► Which bacterial species have a protein that is related in


lineage to a protein whose amino-acid sequence is known?
► Where does the DNA sequenced come from?
► What other genes encode proteins that exhibit structures
similar to the one I just determined?
Background: Identifying Similarity

► Algorithms to match sequences:


► Needleman-Wunsch
► Smith Waterman
► BLAST
Needleman-Wunsch

► Global alignment algorithm


► An example: align COELACANTH and PELICAN
► Scoring scheme: +1 if letters match, -1 for mismatches, -1 for gaps

COELACANTH COELACANTH
P-ELICAN-- -PELICAN--
Needleman-Wunsch Details
► Two-dimensional matrix
► Diagonal when two letters align
► Horizontal when letters paired to gaps

C O E L A C A N T H
C O
P P
E
E E
L
L L
A
I I
C
C C
A
A A
N T H
N N - -
Needleman-Wunsch
► In reality, each cell of matrix contains score and
pointer
► Score is derived from scoring scheme (-1 or +1 in our
example)
► Pointer is an arrow that points up, left, or diagonal
► After initializing matrix, compute the score and arrow
for each cell
Two sequences will be aligned.

GAATTCAGTTA (sequence #1)


GGATCGA (sequence #2)

A simple scoring scheme will be used

Si,j = 1 if the residue at position I of sequence #1 is the same as


the residue at position j of the sequence #2 (called match score)

Si,j = 0 for mismatch score

w = gap penalty
To study the algorithm, consider the two given sequences.
 
CGTGAATTCAT (sequence #1) ,    GACTTAC (sequence #2)
 
The length (count of the nucleotides or amino acids) of the sequence 1 and sequence 2 are
11 and 7 respectively. The initial matrix is created with A+1 column’s and B+1 row’s (where
A and B corresponds to length of the sequences). Extra row and column is given, so as to
align with gap, at the starting of the matrix as shown in Figure 1.
After creating the initial matrix, scoring scheme has to be introduced which can be user defined
with specific scores. The simple basic scoring schema can be assumed as, if two residues
(nucleotide or amino acid) at ith and jth position are same, matching score is 1 (S(i,j)= 1) or if the
two residues at ith and jth position are not same, mismatch score is assumed as -1 (S(i,j)= -1 ).
The gap score(w) or gap penalty is assumed as -1 .

*Note: The scores of match, mismatch and gap can be user defined, provided the gap penalty
should be negative or zero.

Gap score is defined as penalty given to alignment, when we have insertion or deletion.

The dynamic programming matrix is defined with three different steps.


1.Initialization of the matrix with the scores possible.
2.Matrix filling with maximum scores.
3.Trace back the residues for appropriate alignment.
Initialization Step
 
This example assumes that there is gap penalty. First row and first column of the matrix can be
initially filled with 0. If the gap score is assumed, the gap score can be added to the previous cell of
the row or column (Figure 2).
Matrix Fill Step
 
The second and crucial step of the algorithm is matrix filling starting from the upper left hand
corner of the matrix. To find the maximum score of each cell, it is required to know the
neighbouring scores (diagonal, left and right) of the current position. From the assumed values,
add the match or mismatch (assumed) score to the diagonal value. Similarly add the gap score to
the other neighbouring values. Thus, we can obtain three different values, from that take the
maximum among them and fill the ith and jth position with the score obtained.
 
In terms of matrix positions, it is important to know  [M(i-1,j-1)+S(i,j),M(i,j-1)+w,M(i-1,j)+w]
 
Overall the equation can be showed in the following manner

To score the matrix of the current position (the first position M1,1) the above stated
formulae can be used. The first residue (nucleotides or amino acids) in the 2 sequences
are ‘G’ and ‘C’. Since they are mismatching residues, the score would (Si,j=- 1) be -1.
The obtained score -1 is placed in position i,j (1,1) of the scoring matrix. Similarly using the
above equation and method, fill all the remaining rows and columns. Place the back pointers
to the cell from where the maximum score is obtained, which are predecessors of the
current cell (Figure 3).
Trace back Step
 
The final step in the algorithm is the trace back for the best alignment. In the above
mentioned example, one can see the bottom right hand corner score as -1. The
important point to be noted here is that there may be two or more alignments possible
between the two example sequences.
The current cell with value -1 has immediate predecessor, where the maximum score
obtained is diagonally located and its value is 0. If there are two or more values which
points back, suggests that there can be two or more possible alignments.
 
By continuing the trace back step by the above defined method, one would reach to the
0th row, 0th column. Following the above described steps, alignment of two sample
sequences can be found. The best alignment among the alignments can be identified by
using the maximum alignment score (match =5, mismatch=-1, gap=-2) which may be user
defined (Figure 4).
Sequence Alignment

USC Sequence Alignment Server - align 2 sequences with all


possible varieties of dynamic programming

T-COFFEE - multiple sequence alignment

ClustalW @ EBI - multiple sequence alignment

MSA 2.1 - optimal multiple sequence alignment using the


Carrillo-Lipman method

BOXSHADE - pretty printing and shading of multiple alignments


Splign - Splign is a utility for computing cDNA-to-Genomic, or spliced sequence
alignments. At the heart of the program is a global alignment algorithm that specifically
accounts for introns and splice signals.

Spidey - an mRNA-to-genomic alignment program

Wise2 - align a protein or profile HMM against genomic sequence to predict a gene
structure, and related tools

PipMaker - computes alignments of similar regions in two (long) DNA sequences

VISTA - align + detect conserved regions in long genomic sequences

myGodzilla - align a sequence to its ortholog in the human genome


Sequence Alignment

Clustal Omega @ EBI


Clustal Omega is a multiple sequence alignment program for proteins. It produces biologically meaningful
multiple sequence alignments of divergent sequences. Evolutionary relationships can be seen via viewing
Cladograms or Phylograms.

PipMaker

PipMaker computes alignments of similar regions in two DNA sequences. The resulting alignments are
summarized with a "percent identity plot'', or "pip'' for short. MultiPipMaker allows the user to see
relationships among more than two sequences. All pairwise alignments with the first sequence are
computed and then returned as interleaved pips. Moreover, MultiPipMaker can be requested to compute a
true multiple alignment of the input sequences and return a nucleotide-level view of the results.

T-COFFEE@EBI

T-Coffee is a multiple sequence alignment program. Its main characteristic is that it will allow you to
combine results obtained with several alignment methods.
Motif and Pattern Search in Sequences

Gibbs Motif Sampler - identification of conserved motifs in DNA or protein sequences

AlignACE Homepage - gene regulatory motif finding

MEME  - motif discovery and search in protein and DNA sequences

SAM - tools for creating and using Hidden Markov Models

Pratt - discover patterns in unaligned protein sequences

Motivated Proteins - a web facility for exploring small hydrogen-bonded motifs


Gibbs Motif Sampler

The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or
protein sequences.

MEME

Multiple Em for Motif Elicitation (MEME) performs motif discovery on DNA, RNA or protein
datasets and discovers novel, ungapped motifs (recurring, fixed-length patterns) in your
sequences (sample output from sequences). MEME splits variable-length patterns into two
or more separate motifs.

Pratt
The Pratt program is able to discover patterns conserved in sets of unaligned protein
sequences.
Protein Sequence Databases
SWISS-PROT & TrEMBL - Protein sequence database and computer annotated supplement

UniProt - UniProt (Universal Protein Resource) is the world's most comprehensive catalog
of information on proteins. It is a central repository of protein sequence and function
created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

PIR - Protein Information Resource

MIPS - Munich Information centre for Protein Sequences

HUPO - HUman Proteome Organization


Human Genome Databases

•Draft Human Genome @ NCBI

•Draft Human Genome @ UCSC

•Ensembl - automatically annotated human genome.

•GDB - Genome Database

•Mammalian Gene Collection - full-length (open reading frame) sequences


for human and mouse

•STACK - Sequence Tag Alignment and Consensus Knowledgebase

•GeneCards - human genes, proteins and diseases


Databases of other Organisms

GOLD - Genomes OnLine Database, information on complete and ongoing genome projects

TIGR Comprehensive Microbial Resource

TIGR Microbial Database

The Proteome Databases - yeast, worm, & human, good annotation

Saccharomyces Genome Database


WormBase - C. elegans

FlyBase

Berkeley Drosophila Genome Project

Mouse Genome Informatics

The Arabidopsis Information Resource

ZFIN - Zebrafish Information Network

DictyBase - Dictyostelium discoideum

EcoGene - E. coli

HIV sequence database


Genome-wide Analysis

MBGD - comparative analysis of completely sequenced microbial genomes

COGs - phylogenetic classification of orthologous proteins from complete genomes

STRING - detect whether a given query gene occurs repeatedly with certain other genes in potential
operons

Pedant - automatic whole genome annotation

GeneCensus - various whole genome comparisons


Protein Domains: Databases and Search Tools
\
InterPro - integration of Pfam, PRINTS, PROSITE, SWISS-PROT + TrEMBL

PROSITE - database of protein families and domains

Pfam - alignments and hidden Markov models covering many common protein domains

SMART - analysis of domains in proteins

ProDom - protein domain database

PRINTS Database - groups of conserved motifs used to characterise protein families

Blocks - multiply aligned ungapped segments corresponding to the most highly conserved
regions of proteins

Protein Domain Profile Analysis @ BMERC - search a library of profiles with a protein
sequence

TIGRFAMs - yet more protein families based on Hidden Markov Models


Protein 3D Structure

PDB - protein 3D structure database

RasMol / Protein Explorer - molecule 3D structure viewers

SCOP - Structural Classification Of Proteins

UCL BSM CATH classification

The DALI Domain Database

FSSP - fold classification based on structure-structure alignment of proteins

SWISS-MODEL - homology modeling server


Structure Prediction Meta-server

K2 - protein structure alignment

DALI - 3D structure alignment server

DSSP - defines secondary structure and solvent exposure from 3D coordinates

HSSP Database - Homology-derived Secondary Structure of Proteins

PredictProtein & PHD - predict secondary structure, solvent accessibility, transmembrane helices,
and other stuff

Jpred2 - protein secondary structure prediction

PSIpred (& MEMSAT & GenTHREADER) - protein secondary structure prediction (& transmembrane
helix prediction & tertiary structure prediction by threading)
Metabolic, Gene Regulatory & Signal Transduction Network Databases

KEGG - Kyoto Encyclopedia of Genes and Genomes

BioCarta

DAVID - Database for Annotation, Visualization and Integrated Discovery - A useful server to


for annotating microarray and other genetic data.

stke - Signal Transduction Knowledge Environment

BIND - Biomolecular Interaction Network Database

EcoCyc

WIT
PathGuide A very useful collection of resources dealing primarily with pathways

SPAD - Signaling Pathway Database

CSNDB - Cell Signalling Networks Database

PathDB

Transpath

DIP - Database of Interacting Proteins

PFBP - Protein Function and Biochemical Networks

Alliance for Cellular Signalling


DDB (Diseases Database) –

Web site
http://www.diseasesdatabase.com/

Purpose
Diseases Database serves as way of classifying medical concepts along clinical axes, such as
cause/effect, risk factors, interactions, etc., rather than in hierarchies of anatomical, physiological,
or pathological systems.

Description
Diseases Database is a cross-referenced index of human disease, medications, symptoms, signs, and
abnormal investigation findings. The content focuses on internal medicine, inherited disease, clinical
biochemistry, and pharmacology. Its terms are referred to as "items.“

Update Frequency
Diseases Database is updated regularly.
The Gene Disease Associations Database DisGeNET

DisGeNET is a comprehensive gene-disease association database that integrates associations from


several sources that covers different biomedical aspects of diseases.[25] In particular, it is focused on
the current knowledge of human genetic diseases including Mendelian, complex and environmental
diseases.

To assess the concept of modularity of human diseases, this database performs a systematic study
of the emergent properties of human gene-disease networks by means of network topology and
functional annotation analysis.[1]

The results indicate a highly shared genetic origin of human diseases and show that for most
diseases, including Mendelian, complex and environmental diseases, functional modules exist.

Moreover, a core set of biological pathways is found to be associated with most human diseases.
Obtaining similar results when studying clusters of diseases, the findings in this database suggest
that related diseases might arise due to dysfunction of common biological processes in the cell.

The network analysis of this integrated database points out that data integration is needed to obtain
a comprehensive view of the genetic landscape of human diseases and that the genetic origin of
complex diseases is much more common than expected
DisGeNET gene-disease association ontology
The description of each association type in this ontology is:

#Therapeutic Association: The gene/protein has a therapeutic role in the amelioration of the disease.

#Biomarker Association: The gene/protein either plays a role in the etiology of the disease (e.g. participates in the
molecular mechanism that leads to disease) or is a biomarker for a disease.

#Genetic Variation Association: Used when a sequence variation (a mutation, a SNP) is associated to the disease phenotype,
but there is still no evidence to say that the variation causes the disease. In some cases the presence of the variants
increase the susceptibility to the disease. In general, the NCBI SNP identifiers are provided.

#Altered Expression Association: Alterations in the function of the protein by means of altered expression of the gene are
associated with the disease phenotype.

#Post-translational Modification Association: Alterations in the function of the protein by means of post-translational
modifications (methylation or phosphorylation of the protein) are associated with the disease phenotype. [1]
Fluorescent Protein

Fluorescent Protein Visualization


This web site provides interactive graph of fluorescent protein properties

Photoswitchable Fluorescent Proteins


This web site provides interactive graph of photoswitchable fluorescent protein properties

BD Fluorescence Spectrum Viewer


A Fluorescence Spectrum Viewer for Excitation and Emission Curves

ThermoFisher Fluorescence SpectraViewer


A Fluorescence SpectraViewer with options for different Fluorophores, Light Sources,
Excitation Filters and Emission Filters

Fluorescent Protein Plasmids & Resources


Addgene's plasmid repository contains a variety of fluorescent protein plasmids. Use this
guide to learn more about the many applications of fluorescent proteins (FPs) and to find the
plasmids that are available from Addgene's depositing scientists.
Gene Prediction

Genscan - eukaryotes

GeneMark

Genie - eukaryotes

GLIMMER - prokaryotes

tRNAscan - SE 1.1 - search for tRNA genes in genomic sequence

GFF (General Feature Format) Specification - a standard format for genomic sequence
annotation
Other Databases (Annotations, Ontologies, Consortia, etc.)

Entrez Gene - Gene provides a unified query environment for genes defined by sequence and/or
in NCBI's Map Viewer. You can query on names, symbols, accessions, publications, GO terms,
chromosome numbers, E.C. numbers, and many other attributes associated with genes and the
products they encode. Replaces LocusLink.

Cancer Genome Anatomy Project

HUGO's Human Gene Nomenclature

Gene Ontology Consortium -  a controlled vocabulary of eukaryotic gene roles

Open Biological Ontologies an umbrella web address for well-structured controlled vocabularies
for shared use across different biological domains.

ACUTS - compilation of Ancient Conserved UnTranslated Sequences

UTR database
ase of databases, listing all the biological databases currently available on the internet.
ENZYME - enzyme nomenclature database

BRENDA - enzyme database

TC-DB - comprehensive classification of membrane transport proteins

The SNP Consortium


HGBASE - database of sequence variations in the human genome

MethDB - DNA methylation database

SpliceDB - canonical and non-canonical splice site sequences in mammalian genes

SpliceOme - database of intron-exon boundaries

InBase - intein database


The I.M.A.G.E. Consortium

The Kabat Database of Sequences of Proteins of Immunological Interest


Nelson Lab: Cytochrome C

REBASE - restriction enzyme database

Chemfinder.com - molecule database

Genomics Institute of the Novartis Research Foundation


Applications:

Drug Discovery
When the first three-dimensional protein structure was determined the idea of using X-ray
crystallography in drug discovery came up more than 30 years ago. In the space of a decade, a radical
change has started in drug design, incorporating the know-how of 3D target protein structures into the
design process

At every stage of the design process, protein structure can influence drug discovery. It is traditionally
used in lead optimization, a process that uses a structure to guide the chemical modification of a lead
molecule in order to optimize shape, hydrogen bonds, and other non-covalent interactions with the
objective.

Gene Therapy
Gene therapy is a new form of drug delivery that includes a therapeutic agent produced by the patient’s
synthetic machinery. Gene therapy In order to produce enough protéin encoded by the gene (transgene),
it involves the efficient introduction of functional genes into the patient’s appropriate cells, so as to
accurately and permanently correct the disorder.
Personalized Medicine
Personalized medicine is a medical model that customizes health care with the application of
genetic or other information to tailor all decisions and practices to the individual patient.
Application outside long-established considerations, such as the family history of a patient, social
conditions, the environment, and behavior are so far very limited, and in the last decade, almost
no progress has been made.

Custom medical research seeks to identify solutions on the basis of each individual’s susceptibility
profile. These are the areas where new diagnostic, medication development and individual
therapy approaches can be found.

Preventive Medicine
Prevention medicine or preventive treatment involves measures taken not to cure, to treat, or to
treat diseases (or injuries). This is contrasting in methods with curative medicines and palliative
medicines and in relation to methods of public health (which are not individual but
population-based health). Simple examples are hand washing, nursing, and immunization for
preventive medicine.
Microbial Genome Applications
In the field of Microbial Genome Applications, applications of bioinformatics are used for following
areas:

Waste Cleanup
Bacteria and microbes that are helpful in the purification of waste are identified in bioinformatics.
Deinococcus radioduran bacterium has the ability to repair DNA damaged and small chromosome
fragments by isolating damaged segments within a concentrated area. Deinococcus radioduran is a
bacteria listed in the World Book of Guinness.

Climate Change
Climate change is due to variations of the earth’s solar radiation, plate tectonics, and volcanic eruptions, or human
changes to the natural world, including oceanic processes (such as the occurrence of oceanic circulation).

By studying microorganisms, genome researchers can understand these microbes at a very basic level, isolating
genes that enable them to survive in extreme conditions. A phototrophic purple, a non-sulfur bacterium
commonly found in soil and water is Rhodopseudomonas palustris. By absorbing carbon dioxide from the
atmosphere and converting it into biomass, the sun is turned into cellular energy.
Biotechnology
It comprises, through breeding programming using artificial selection and hybridization, a wide concept of
“biotechnology” or “biotechnology” which covers a range of procedures for modifying live organisms in a
manner that refers to animal domestication, plant cultivation, and improvement.

Modern uses also include genetic engineering and technology for cell and tissue culture. Biotechnology
identified organisms and microorganisms which may be useful in the dairy sector and food processors in the
field of bioinformatics.

Lactococcus lactis, a non-pathogenic rod shape bacterium that is critical for the production of dairy foodstuffs
like buttermilk, yoghurt and cheese, is one of the most important microorganisms involved in the dairy sector.
It is also the use of this bacterium in the preparation of spiced vegetables, beer, wine, and other bread and
sausages.

Researchers expect the understanding of the physiologic and genetic make-up of this bacterium to be
invaluable for food producers as well as the L capability research pharmaceutical industry. Lactis to be used
as a drug supply vehicle.
Agriculture
In the field of Agriculture, applications of bioinformatics are in following areas:

Crop Improvement
Comparative genetics of plant genomes showed that their genes’ organization remained more conserved
than had previously been expected during evolutionary time. These findings show that information from
model crop systems can be used to suggest improvements to other food crops.

These results show. Examples of the available complete plant genomes are Arabidopsis thaliana
(watercress) and Oryza sativa (rice). The first sequenced plant Arabidopsis thaliana is considered as a
model of plant genetics and biology investigation.

In all plants, there are many genes similar, and in a model organism such as A the analysis of genes.
Thaliana helps us to understand and function gene expression in all plants. In addition, many of the genes
found in A because both animals and plants are eukarya. Thailand’s got animal homologs.

The most common reason it has been chosen as a genome-sectional model organization DNA Arabidopsis
consists of approximately 140 million foundations which are divided into five chromosomes, Arabidopsis
has the smallest genome from each flowering plant.

You might also like