Professional Documents
Culture Documents
A starter
K. Mani.
Reader in Botany,
PSG College of Arts and Science,
Coimbatore, Tamilnadu, India.
kmani52@gmail.com
Summary
• What is Bioinformatics?
• How it started?
• Who are the Bioinformaticians? the tool makers and the tool users
• The information: Nucleotides, proteins, and structures.
• The Database: Primary, Secondary and special.
• The elements:
• Sequence analysis: Pair wise and Multiple sequence analysis
• Phylogenetic analysis:
• Genomics: Structural, functional and comparative
• Proteomics: Structural, functional, comparative and interactive
• Metabolomics:Reconstruction of metabolic pathways
• Systems Biology: Cell function Simulation
• Applications of Bioinformatics:
• Conclusion
What is Bioinformatics?
Bioinformatics is about collecting, storing, maintaining and distribution of
Biological data for extraction and identification of meaningful information and
converting them into knowledge with the help of computers.
The first phase in Bioinformatics is conversion of Biological data into digital
format. The second phase is cleaning and arranging the data into an easily retrievable
from. This is called the database.
The third phase is extraction of hidden information in the data by comparison and
analysis using computer programs to convert the data into knowledge.
How it started?
Two major incidents in Biology gave a kick start for the bioinformatics. The
attempts of Margarette Dayhoff in 1980 to analyze protein sequences with the help of
computer program were the initial ground work for Bioinformatics. However the real
jumpstart occurred only after the release of Human genome draft in the year 2001.
All necessary information for the growth and development of an organism is
already present in the nucleotide sequences of its genome. Similarly the structure and
function of a protein is already inscribed in its primary sequence. Therefore all one need
to know is the sequence the remaining can be discerned from it using computational
analysis.
The attempt to cull out every bit of information from a given sequence or
structure is the bioinformatics.
Specialized databases
Special data like Expressed sequence tags, sequence tagged sites, Single
nucleotide polymorphism which are created for special purpose and they comprise the
content of special database. These data can be again obtained from NCBI.
The STS are short sequences of nucleic acid which help to locate a gene by its
unique sequence. EST are reverse transcribed DNA sequence from mRNA. The set of
EST from a particular cell type gives information about the type of gene expression. The
assembly of EST sequences offers us the complete gene sequence without the
intervention of introns.
SNP data are helpful as genetic marker in location of specific disease related
genes on the chromosomes. It is also useful as disease diagnostic data. OMIM is another
specialized data base that lists out all known inherited diseases of Man and mouse. Apart
from giving the literature, sequence it also offers links to databases that contain related
data.
Other specialized databases are species specific. The biological data concerted
with a particular model organism is maintained separately as special database. Worm
base is for Caenorhabditis elegans, Cyanobase is for Cyanobacteria, ATH for Arabidopsis
thaliana, Fly base is for Drosophila.
PubMed and Agricola are specialized database for biological literatures.
Secondary Database
Secondary, derived or value added databases are highly information enriched data
sources. The primary data after meaningful classification and curation are called
secondary data. Secondary data are free from redundancy, easy to retrieve and ready for
use in the analytical point of view.
The overwhelming size of the primary data makes the user very difficult to
retrieve successfully desired data. For example millions of protein sequences are reduced
to limited protein classes and domains. Similarly billions of nucleic acid sequences are
classified as unigenes, gene families, and functionally related genes. Apart from its easy
to retrieve options they help one to predict the structure and function of unknown gene or
protein easily.
Elements of Bioinformatics
The study of Bioinformatics has following aspects.
• Sequence analysis
• Genomics
• Proteomics
• Metabolomics
• Systems biology
• Application bioinformatics
Sequence analysis
Currently there are 52 billion nucleotide bases and 50 million sequences of DNA
in the GenBank. These DNA sequences belong to several species of organisms including
plants, animals, microbes and viruses. Equally amazing is the protein sequence data. The
Uniprot of EXPASSY server has several thousands of well curated unique protein
sequences. The first and basic approach in bioinformatics is analysis of the sequences.
There are two types of sequence comparison namely the pair wise sequence comparison
and the multiple sequence comparison.
Phylogenetic analysis
Organisms acquire new genes when they meet new environment. New genes help
the individual to survive better in the new environment. New genes are created from the
already existing genes by duplication and variation process. Some organisms borrow new
genes laterally from other species.
The family of genes that have arose due to duplication are related to each other as
siblings. These are homologous sequences buy specifically named as paralogous genes
because they exist in the same genome.
Those family of homologous genes that are distributed in several species and yet
have common origin are called orthologous genes. Molecular phylogeny rely upon
counting the number of mutational steps that have occurred between the sequences to
measure their phylogenetic distance. The number of mutations also serve as a kind of
molecular clock to calculate the time passed between the original common ancestor
sequence to the present sequence.
There are two major approaches in deciding the phylogenetic distance between
the sequences. Phenetic and Cladistic approaches are the two. Phenetic approach is very
straight forward, fast but artificial. The Cladistic approach is computationally intensive
but natural. Cladistic approach gives different weightage to aminoacid substitution based
upon the number of codons that have intervened from the ancestral state to current state.
Maximum parsimony and Maximum likelihood are the two Cladistic algorithms used in
phylogenetic analysis.
Genomics
At present (2007) there are nearly 1000 completed genomes deposited in the
Genbank. Genomes of model organisms like mouse, rat, drosophila, fish, Chimpanzee,
sea urchin, worm, yeast, Neurospora, Arabidopsis, rice, poplar and several hundreds of
bacteria, archaea and viruses are now available. Gene expression data, better known as
micro array experiment data are also accumulating in the public databases.
Genomic data analysis has three perspectives. Structural genomics, comparative
genomics and functional genomics are the three line of perspectives.
Genome is the complete set of information based on which a zygote develops and
grow into an adult organism. If bioinformatics is what it is believed to be, it should
decipher the codified genetic information that turn the single diploid cell into a complete
organism.
The objective of the genome sequencing is not only to unravel the mysteries that
shroud the developmental biology but also to improve the crop or to mitigate the human
sufferings from diseases. "If only I had the opportunity to have entire human genome in
my computer I wouldn't have spent 7 hard years to locate a single gene that was
responsible for Cystic fibrosis disorder. It would have taken only a few hours to do so"
said before Collins the Director of Human genome project completed the project.
Structural genomics
Structural genomics tries to locate the precise position of the genes and their
regulatory elements on the genome. Before genome sequence was available scientists
relied upon genetic mapping to locate a gene. Either the Genetic mapping (linkage
mapping) or physical mapping (RFLP, SINE or LINE repetitive sequence position
mapping) were not very precise as that Genome itself. Genome sequence alone is the
ultimate of the gene mapping because it places the genes to its exact nucleotide.
Structural genomics aims at mapping genes, their regulatory elements, exons,
introns, poly A tail start point, pseudo genes, paralogs, non coding conserved regions etc
on the genome.
Functional genomics
Assigning genes their function and their regulations expression under different
situations are called functional genomics. The recently introduced Micro array analysis of
entire genome under varied experimental conditions has revolutionized our understanding
of gene expression. In a single experiment studying the levels of expression of 10000
genes under varied physiologic conditions is possible now.
Micro array data for yeast gene expression under 600 different set of
physiological conditions are available with Stanford Microarray database. It is only a
fraction of the real amount of data available there.
These data offer biologists to compare the gene expression level under diseased
state and normal state. Comparison of cancer cells with normal cell would reveal those
genes that are involved in the cancer state. Genes that are under expressed, over
expressed, co expressed, contra expressed can be clustered and studied in detail in the
laboratories. The computer aided drug designers rely upon these analysis to find a
suitable drug target proteins.
Comparative genomics
Comparison of two or more genomes with each other is the major study in
bioinformatics. Genome level comparison of couple of species supply several
information such as:
• Genome evolution
• Differences in the metabolic pathways
• Distinguish the parasitic biology from saprophytic biology
• Distinguish the toxigenic genes from non toxigenic genes
• Drug resistant from susceptible
• Tolerant with intolerant
When two closely related genomes are compared with a distantly related genome
has revealed several highly conserved but non coding regions of genome. Genomes are
compared at several levels. One has to start with comparison at nucleotide level.
Dinucleotide frequency, codon preference, CpG islands etc will reveal interesting
differences between the organisms. Comparison at gene level would reveal evolution of
genes and gene organization, gene synteny, chromosomal rearrangement etc. Comparing
gene regulatory elements will reveal the operon systems. Finally one may go up to the
level of comparing the genes related to various metabolism and gene ontology.
Proteomics
'one gene one enzyme' has become a sentence of historic important, because one
gene several protein is the reality in eukaryotic systems now. That single gene can code
for more than one protein due to alternate splicing of exons has helped us to come out of
the riddle how come a cell posses number of proteins three times to that of genes! The
reason for our inability match the phenotypes with genotype is because we were stuck to
genes alone. Indeed the proteins are the real stalwarts that manifest the phenotypes. The
epigenetic mysteries become explainable.
Proteomics is documenting every individual protein of a cell. It includes
understanding the structure and function of the proteins. Just like genomics, proteomics
also has three perspectives. Structural proteomics deals with constructing protein
structures. Functional proteomics is about the expression of proteins in a cell under varied
conditions. Comparative Proteomics deals with comparison of entire protein complement
of two different cells or a cell under different conditions. Added to these perspectives
study of protein-protein interaction has become a major interest in biology.
Protein function prediction
Protein sequence data and protein structure data are very impressive. Though the
Uniprot boasts itself to be well curated protein database, yet 80 percent of proteins in a
given genome are only hypothetical or putative. Assigning function to the hypothetical
proteins is the major task of bioinformatics.
Structure prediction
Protein structures are elucidated by X-ray crystallography and Nuclear Magnetic
resonance imaging techniques. This is possible only after a particular protein has been
purified 100 % and made into a crystal. As the trans membrane proteins are not amenable
for purification and crystallization the number of entries in Protein Data Bank for trans
membrane proteins are limited. Unfortunately most of the proteins implicated in the
human diseases are membrane proteins a urgent need to construct the protein structure
out of its sequence knowledge become imminent.
Without the structure of the target protein computer aided drug designing is
impossible. Multi billion opportunity awaits person capable of modeling membrane
proteins theoretically.
Protein interaction
Proteins function in team. Proteins are capable of self assembly. Many proteins
assemble themselves into a single entity and bring about certain cell function
cooperatively. How proteins recognize their partners, and interact precisely is a
fascinating study. Though yeast dihybrid experiments are now widely used technique to
study the protein-protein interaction, bioinformatics method has greater scope in the near
future.
Metabalomics
Now a day's increasing amount of interest is shown in sequencing the genomes of
extremophilic micro organisms. Archaea are extremophilic organisms capable of living
under temperature nearing boiling point of water and pressure that exceeds thousands of
barometers. The genes and proteins of these organisms are extremely useful in
biotechnology. It may be recollected how the discovery of TAQ polymerase an enzyme
derived form an archaea have revolutionized the genetic engineering and biotechnology
with its PCR mechanism.
Bioinformatics skill can help a person to reconstruct the entire metabolic pathway
of an organism from its genomic sequence alone. Predicting the entire set of metabolism
from genomic sequences is possible now thanks to the metabolic pathway database of
KEGG.
Comparative genomics reveals only one fourth of the information. Proteomics
supplies another fifty percent and only the metabolomics completes the information.
While the former two are static the latter is dynamics of the cell. In other words
metabolomics fulfills the demands of Bioinformatics.
Systems Biology
The pinnacle of Bioinformatics resides in simulation of cellular process inside the
computer system. The cellular system is simulated in total by a computer system is
systems biology.
Cell with all its metabolic complements, regulations, energetics and dynamics can
be simulated only if we know entire cell biology. Simulating a human cell though may be
the ultimate goal, our current knowledge permits only for simulating a prokaryotic
minimal cell.
A minimal cell is one which is capable of living independently and carry out all
its biological activities such as growth, reproduction and adjustments to environment with
minimum number of genes. The smallest genome comprising only 500 and odd genes
belong to a couple of parasitic bacteria. Truly independent cell might need at least 1000
genes. If one has achieved the knowledge of minimal gene the true E Cell will go ahead
with its fullness.
Cell simulation has enormous potential. It is a boon for biotechnologists. They
can study cell expression, metabolic regulation and perform several gene manipulation
and expression virtually. The virtual cell would predict the toxicology, pharmaco
dynamics and pharmacokinetic properties of drugs without even toughing mouse or rat in
the laboratory. Animal activist would rest in peace.
Conclusion
Like any other branch of science Bioinformatics also has started with a humble
objectives and cold reception. But unlike the other scientific discipline, it has soared to
unimaginable heights. It is a kind of science totally different from others in viewing a
problem from whole to part, rather than from part to the whole. Its Gestalt view of the
problem and fathered by several branch of science are unique. Molecular biologists,
pathologists, biologists, computer scientists, statisticians, mathematicians, biophysicists
and biochemists have joined their hands in the development of Bioinformatics.
Biology is no longer the same now. If a biologist leaves out bioinformatics they
are lame. The future battle with human suffering and conserving nature is going to be
fought with the weapons of Bioinformatics.