Professional Documents
Culture Documents
Many databases also provide bioinformatics tools, such as BLAST, for finding specific
sequences or annotations. There are hundreds of genomics databases: some are
comprehensive, but are not carefully curated (GenBank), while others are carefully
curated, but are narrow (FlyBase).
Bioinformatic tools are computer programs that analyze one or more sequences.
There are a dizzying array of bioinformatic tools that can analyze sequences to find protein
domains (Pfam), or that can search through databases of millions of sequences to find ones that
are similar (BLAST) or that can find potential protein-coding regions (ORF-Finder).
Many are freely available over the web. It can be overwhelming to find and use bioinformatic tools
because you need to know
1) what type of analysis you want to perform
2) what type of tool to use
3) where to find the tool.
The Online Bioinformatics Resources Collection (OBRC) contains annotations and links for
2428 bioinformatics databases and software tools.
Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and
others) (143)
Plant Databases (146)
Proteomics Resources (58)
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between
sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance of matches. BLAST can be used to
infer functional and evolutionary relationships between sequences as well as help identify
members of gene families
BLAT@UCSC
BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25
bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect
sequence matches of 20 bases. BLAT on proteins finds sequences of 80% and greater similarity
of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein
blat on land vertebrates.
FASTA @ EBI
This tool provides sequence similarity searching against protein databases using the FASTA
suite of programs. FASTA provides a heuristic search with a protein query.
FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH
(local), GGSEARCH (global) and GLSEARCH (global query, local database)
Issues to consider
COELACANTH COELACANTH
P-ELICAN-- -PELICAN--
Needleman-Wunsch Details
► Two-dimensional matrix
► Diagonal when two letters align
► Horizontal when letters paired to gaps
C O E L A C A N T H
C O
P P
E
E E
L
L L
A
I I
C
C C
A
A A
N T H
N N - -
Needleman-Wunsch
► In reality, each cell of matrix contains score and
pointer
► Score is derived from scoring scheme (-1 or +1 in our
example)
► Pointer is an arrow that points up, left, or diagonal
► After initializing matrix, compute the score and arrow
for each cell
Two sequences will be aligned.
w = gap penalty
To study the algorithm, consider the two given sequences.
CGTGAATTCAT (sequence #1) , GACTTAC (sequence #2)
The length (count of the nucleotides or amino acids) of the sequence 1 and sequence 2 are
11 and 7 respectively. The initial matrix is created with A+1 column’s and B+1 row’s (where
A and B corresponds to length of the sequences). Extra row and column is given, so as to
align with gap, at the starting of the matrix as shown in Figure 1.
After creating the initial matrix, scoring scheme has to be introduced which can be user defined
with specific scores. The simple basic scoring schema can be assumed as, if two residues
(nucleotide or amino acid) at ith and jth position are same, matching score is 1 (S(i,j)= 1) or if the
two residues at ith and jth position are not same, mismatch score is assumed as -1 (S(i,j)= -1 ).
The gap score(w) or gap penalty is assumed as -1 .
*Note: The scores of match, mismatch and gap can be user defined, provided the gap penalty
should be negative or zero.
Gap score is defined as penalty given to alignment, when we have insertion or deletion.
To score the matrix of the current position (the first position M1,1) the above stated
formulae can be used. The first residue (nucleotides or amino acids) in the 2 sequences
are ‘G’ and ‘C’. Since they are mismatching residues, the score would (Si,j=- 1) be -1.
The obtained score -1 is placed in position i,j (1,1) of the scoring matrix. Similarly using the
above equation and method, fill all the remaining rows and columns. Place the back pointers
to the cell from where the maximum score is obtained, which are predecessors of the
current cell (Figure 3).
Trace back Step
The final step in the algorithm is the trace back for the best alignment. In the above
mentioned example, one can see the bottom right hand corner score as -1. The
important point to be noted here is that there may be two or more alignments possible
between the two example sequences.
The current cell with value -1 has immediate predecessor, where the maximum score
obtained is diagonally located and its value is 0. If there are two or more values which
points back, suggests that there can be two or more possible alignments.
By continuing the trace back step by the above defined method, one would reach to the
0th row, 0th column. Following the above described steps, alignment of two sample
sequences can be found. The best alignment among the alignments can be identified by
using the maximum alignment score (match =5, mismatch=-1, gap=-2) which may be user
defined (Figure 4).
Sequence Alignment
Wise2 - align a protein or profile HMM against genomic sequence to predict a gene
structure, and related tools
PipMaker
PipMaker computes alignments of similar regions in two DNA sequences. The resulting alignments are
summarized with a "percent identity plot'', or "pip'' for short. MultiPipMaker allows the user to see
relationships among more than two sequences. All pairwise alignments with the first sequence are
computed and then returned as interleaved pips. Moreover, MultiPipMaker can be requested to compute a
true multiple alignment of the input sequences and return a nucleotide-level view of the results.
T-COFFEE@EBI
T-Coffee is a multiple sequence alignment program. Its main characteristic is that it will allow you to
combine results obtained with several alignment methods.
Motif and Pattern Search in Sequences
The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or
protein sequences.
MEME
Multiple Em for Motif Elicitation (MEME) performs motif discovery on DNA, RNA or protein
datasets and discovers novel, ungapped motifs (recurring, fixed-length patterns) in your
sequences (sample output from sequences). MEME splits variable-length patterns into two
or more separate motifs.
Pratt
The Pratt program is able to discover patterns conserved in sets of unaligned protein
sequences.
Protein Sequence Databases
SWISS-PROT & TrEMBL - Protein sequence database and computer annotated supplement
UniProt - UniProt (Universal Protein Resource) is the world's most comprehensive catalog
of information on proteins. It is a central repository of protein sequence and function
created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
GOLD - Genomes OnLine Database, information on complete and ongoing genome projects
FlyBase
EcoGene - E. coli
STRING - detect whether a given query gene occurs repeatedly with certain other genes in potential
operons
Pfam - alignments and hidden Markov models covering many common protein domains
Blocks - multiply aligned ungapped segments corresponding to the most highly conserved
regions of proteins
Protein Domain Profile Analysis @ BMERC - search a library of profiles with a protein
sequence
PredictProtein & PHD - predict secondary structure, solvent accessibility, transmembrane helices,
and other stuff
PSIpred (& MEMSAT & GenTHREADER) - protein secondary structure prediction (& transmembrane
helix prediction & tertiary structure prediction by threading)
Metabolic, Gene Regulatory & Signal Transduction Network Databases
BioCarta
EcoCyc
WIT
PathGuide A very useful collection of resources dealing primarily with pathways
PathDB
Transpath
Web site
http://www.diseasesdatabase.com/
Purpose
Diseases Database serves as way of classifying medical concepts along clinical axes, such as
cause/effect, risk factors, interactions, etc., rather than in hierarchies of anatomical, physiological,
or pathological systems.
Description
Diseases Database is a cross-referenced index of human disease, medications, symptoms, signs, and
abnormal investigation findings. The content focuses on internal medicine, inherited disease, clinical
biochemistry, and pharmacology. Its terms are referred to as "items.“
Update Frequency
Diseases Database is updated regularly.
The Gene Disease Associations Database DisGeNET
To assess the concept of modularity of human diseases, this database performs a systematic study
of the emergent properties of human gene-disease networks by means of network topology and
functional annotation analysis.[1]
The results indicate a highly shared genetic origin of human diseases and show that for most
diseases, including Mendelian, complex and environmental diseases, functional modules exist.
Moreover, a core set of biological pathways is found to be associated with most human diseases.
Obtaining similar results when studying clusters of diseases, the findings in this database suggest
that related diseases might arise due to dysfunction of common biological processes in the cell.
The network analysis of this integrated database points out that data integration is needed to obtain
a comprehensive view of the genetic landscape of human diseases and that the genetic origin of
complex diseases is much more common than expected
DisGeNET gene-disease association ontology
The description of each association type in this ontology is:
#Therapeutic Association: The gene/protein has a therapeutic role in the amelioration of the disease.
#Biomarker Association: The gene/protein either plays a role in the etiology of the disease (e.g. participates in the
molecular mechanism that leads to disease) or is a biomarker for a disease.
#Genetic Variation Association: Used when a sequence variation (a mutation, a SNP) is associated to the disease phenotype,
but there is still no evidence to say that the variation causes the disease. In some cases the presence of the variants
increase the susceptibility to the disease. In general, the NCBI SNP identifiers are provided.
#Altered Expression Association: Alterations in the function of the protein by means of altered expression of the gene are
associated with the disease phenotype.
#Post-translational Modification Association: Alterations in the function of the protein by means of post-translational
modifications (methylation or phosphorylation of the protein) are associated with the disease phenotype. [1]
Fluorescent Protein
Genscan - eukaryotes
GeneMark
Genie - eukaryotes
GLIMMER - prokaryotes
GFF (General Feature Format) Specification - a standard format for genomic sequence
annotation
Other Databases (Annotations, Ontologies, Consortia, etc.)
Entrez Gene - Gene provides a unified query environment for genes defined by sequence and/or
in NCBI's Map Viewer. You can query on names, symbols, accessions, publications, GO terms,
chromosome numbers, E.C. numbers, and many other attributes associated with genes and the
products they encode. Replaces LocusLink.
Open Biological Ontologies an umbrella web address for well-structured controlled vocabularies
for shared use across different biological domains.
UTR database
ase of databases, listing all the biological databases currently available on the internet.
ENZYME - enzyme nomenclature database
Drug Discovery
When the first three-dimensional protein structure was determined the idea of using X-ray
crystallography in drug discovery came up more than 30 years ago. In the space of a decade, a radical
change has started in drug design, incorporating the know-how of 3D target protein structures into the
design process
At every stage of the design process, protein structure can influence drug discovery. It is traditionally
used in lead optimization, a process that uses a structure to guide the chemical modification of a lead
molecule in order to optimize shape, hydrogen bonds, and other non-covalent interactions with the
objective.
Gene Therapy
Gene therapy is a new form of drug delivery that includes a therapeutic agent produced by the patient’s
synthetic machinery. Gene therapy In order to produce enough protéin encoded by the gene (transgene),
it involves the efficient introduction of functional genes into the patient’s appropriate cells, so as to
accurately and permanently correct the disorder.
Personalized Medicine
Personalized medicine is a medical model that customizes health care with the application of
genetic or other information to tailor all decisions and practices to the individual patient.
Application outside long-established considerations, such as the family history of a patient, social
conditions, the environment, and behavior are so far very limited, and in the last decade, almost
no progress has been made.
Custom medical research seeks to identify solutions on the basis of each individual’s susceptibility
profile. These are the areas where new diagnostic, medication development and individual
therapy approaches can be found.
Preventive Medicine
Prevention medicine or preventive treatment involves measures taken not to cure, to treat, or to
treat diseases (or injuries). This is contrasting in methods with curative medicines and palliative
medicines and in relation to methods of public health (which are not individual but
population-based health). Simple examples are hand washing, nursing, and immunization for
preventive medicine.
Microbial Genome Applications
In the field of Microbial Genome Applications, applications of bioinformatics are used for following
areas:
Waste Cleanup
Bacteria and microbes that are helpful in the purification of waste are identified in bioinformatics.
Deinococcus radioduran bacterium has the ability to repair DNA damaged and small chromosome
fragments by isolating damaged segments within a concentrated area. Deinococcus radioduran is a
bacteria listed in the World Book of Guinness.
Climate Change
Climate change is due to variations of the earth’s solar radiation, plate tectonics, and volcanic eruptions, or human
changes to the natural world, including oceanic processes (such as the occurrence of oceanic circulation).
By studying microorganisms, genome researchers can understand these microbes at a very basic level, isolating
genes that enable them to survive in extreme conditions. A phototrophic purple, a non-sulfur bacterium
commonly found in soil and water is Rhodopseudomonas palustris. By absorbing carbon dioxide from the
atmosphere and converting it into biomass, the sun is turned into cellular energy.
Biotechnology
It comprises, through breeding programming using artificial selection and hybridization, a wide concept of
“biotechnology” or “biotechnology” which covers a range of procedures for modifying live organisms in a
manner that refers to animal domestication, plant cultivation, and improvement.
Modern uses also include genetic engineering and technology for cell and tissue culture. Biotechnology
identified organisms and microorganisms which may be useful in the dairy sector and food processors in the
field of bioinformatics.
Lactococcus lactis, a non-pathogenic rod shape bacterium that is critical for the production of dairy foodstuffs
like buttermilk, yoghurt and cheese, is one of the most important microorganisms involved in the dairy sector.
It is also the use of this bacterium in the preparation of spiced vegetables, beer, wine, and other bread and
sausages.
Researchers expect the understanding of the physiologic and genetic make-up of this bacterium to be
invaluable for food producers as well as the L capability research pharmaceutical industry. Lactis to be used
as a drug supply vehicle.
Agriculture
In the field of Agriculture, applications of bioinformatics are in following areas:
Crop Improvement
Comparative genetics of plant genomes showed that their genes’ organization remained more conserved
than had previously been expected during evolutionary time. These findings show that information from
model crop systems can be used to suggest improvements to other food crops.
These results show. Examples of the available complete plant genomes are Arabidopsis thaliana
(watercress) and Oryza sativa (rice). The first sequenced plant Arabidopsis thaliana is considered as a
model of plant genetics and biology investigation.
In all plants, there are many genes similar, and in a model organism such as A the analysis of genes.
Thaliana helps us to understand and function gene expression in all plants. In addition, many of the genes
found in A because both animals and plants are eukarya. Thailand’s got animal homologs.
The most common reason it has been chosen as a genome-sectional model organization DNA Arabidopsis
consists of approximately 140 million foundations which are divided into five chromosomes, Arabidopsis
has the smallest genome from each flowering plant.