Bioinformatics A Starter

Bioinformatics
A starter
K. Mani.
Reader in Botany,
PSG College of Arts and Science,
Coimbatore, Tamilnadu, India.
kmani52@gmail.com
Summary
• What is Bioinformatics?
• How it started?
• Who are the Bioinformaticians? the tool makers and the tool users
• The information: Nucleotides, proteins, and structures.
• The Database: Primary, Secondary and special.
• The elements:
• Sequence analysis: Pair wise and Multiple sequence analysis
• Phylogenetic analysis:
• Genomics: Structural, functional and comparative
• Proteomics: Structural, functional, comparative and interactive
• Metabolomics:Reconstruction of metabolic pathways
• Systems Biology: Cell function Simulation
• Applications of Bioinformatics:
• Conclusion
What is Bioinformatics?
Bioinformatics is about collecting, storing, maintaining and distribution of
Biological data for extraction and identification of meaningful information and
converting them into knowledge with the help of computers.
The first phase in Bioinformatics is conversion of Biological data into digital
format. The second phase is cleaning and arranging the data into an easily retrievable
from. This is called the database.
The third phase is extraction of hidden information in the data by comparison and
analysis using computer programs to convert the data into knowledge.
How it started?
Two major incidents in Biology gave a kick start for the bioinformatics. The
attempts of Margarette Dayhoff in 1980 to analyze protein sequences with the help of
computer program were the initial ground work for Bioinformatics. However the real
jumpstart occurred only after the release of Human genome draft in the year 2001.
All necessary information for the growth and development of an organism is
already present in the nucleotide sequences of its genome. Similarly the structure and
function of a protein is already inscribed in its primary sequence. Therefore all one need
to know is the sequence the remaining can be discerned from it using computational
analysis.
The attempt to cull out every bit of information from a given sequence or
structure is the bioinformatics.
The two pillars of bioinformatics

Since, bioinformatics is a happy marriage of computer and biology, they remain
as the two pillars of Bioinformatics. Computer savvy biologists and Biology loving
computational scientists both enjoy the Bioinformatics equally. Each of them have their
own share in the development and growth of Bioinformatics.
Computational domain involves creation and maintenance of biological data and
developing beastly program that can crunch the huge data and smart analytical tools that
would extract hidden secrets buried inside the genetic codes.
The second domain belongs to Biologists. The secrets hidden in the nucleotide
and amino acid sequence are unearthed using intelligent tools.
For instant from a given nucleic acid sequence the following information can be
dug out.
• Locating position genes on the genome sequence
• Locating eh intron and exon positions
• Identifying the deletion, insertion and substitution mutations
• Locating paralogous, orthologous and psuedogenes
• Identifying non coding gene regulatory elements
• The protein to which the gene codes for
• The gene flanking sequences that would serve as primer for PCR
Similarly from protein sequences the following and much more information can be
traced:
• Protein function
• Protein secondary structure
• Tertiary structure
• Super secondary structures
• Domains
• Patterns
• Motifs
• Antigenic regions
• Post translational modifications
• Grand Hydropathy
• Molecular weight
• Electrical properties
• Half life
• Optical extinction coefficient
• Interaction with other molecule including other proteins
• Cellular location
• Signal peptides
• Active site
• Probable location in the 2D gel electrophoresis
• Trans membrane properties
• Phylogeny
• Iso-electric point
The nature of Biological data
There are three kinds of biological data.

• The gene related
• protein related
• gene and protein function related and
• Structure related.
The gene related data consists of gene sequences, the location of the genes and the
regulatory elements. Further the Genome sequences, chromosomal architecture, non
coding regions, gene markers, chromosome map elements, and Single nucleotide
polymorphic markers. These data are stored as text files in a format known as FASTA.
The protein related data include sequences, domain, pattern, motif etc are stored
in FASTA format in the databanks.
The structure related data include: three dimensional atomic coordinate files of
proteins and small molecular compounds, ligands and tRNA such that.
The Data Banks
Biological databanks collect, maintain and distribute data. Some databanks are
large store houses known as data warehouse. Some databanks are well classified like
supermarkets.
Sequence, structure and gene expression data are maintained digitally in robust
computer servers and linked to Internet through World Wide Web. Excepting a few, all
most all biological data are free and open access.
The Big three

Across the world there are three major biological data bases. These databases can
be accessed through internet browsers. The site maps of these databases will take the
visitor to a tour and introduce all their components.
National Center for Biotechnology Information (NCBI)
European Molecular Biology laboratory (EMBL)
DNA data bank of Japan (DDBJ)
Nucleic acid sequences either as genes or genomes with complete annotations are
available in these databases. Entrez (NCBI), SRS (EMBL) are the user friendly data
retrieval systems offered by the databases themselves.
Classification of DATA bases
Every first issue of the year of Nucleic acid Research come out with newly added
databases. The total number of databases in the world is nearing now 1000. Databases
can be classified based on the nature of data source. Primary databases consists of data
derived from the source laboratories. The secondary databases are information enriched
data derived from the primary databases. The specialized databases contain data of
special interest.
Primary databases stack either nucleic acid sequence data or Protein sequence
data. Structure files from crystallography research centers are other primary data. Genes,
genomes, micro array gene expression data, protein sequences data, protein structure data
are the major primary data.
Specialized databases
Special data like Expressed sequence tags, sequence tagged sites, Single
nucleotide polymorphism which are created for special purpose and they comprise the
content of special database. These data can be again obtained from NCBI.
The STS are short sequences of nucleic acid which help to locate a gene by its
unique sequence. EST are reverse transcribed DNA sequence from mRNA. The set of
EST from a particular cell type gives information about the type of gene expression. The
assembly of EST sequences offers us the complete gene sequence without the
intervention of introns.
SNP data are helpful as genetic marker in location of specific disease related
genes on the chromosomes. It is also useful as disease diagnostic data. OMIM is another
specialized data base that lists out all known inherited diseases of Man and mouse. Apart
from giving the literature, sequence it also offers links to databases that contain related
data.
Other specialized databases are species specific. The biological data concerted
with a particular model organism is maintained separately as special database. Worm
base is for Caenorhabditis elegans, Cyanobase is for Cyanobacteria, ATH for Arabidopsis
thaliana, Fly base is for Drosophila.
PubMed and Agricola are specialized database for biological literatures.
Secondary Database
Secondary, derived or value added databases are highly information enriched data
sources. The primary data after meaningful classification and curation are called
secondary data. Secondary data are free from redundancy, easy to retrieve and ready for
use in the analytical point of view.
The overwhelming size of the primary data makes the user very difficult to
retrieve successfully desired data. For example millions of protein sequences are reduced
to limited protein classes and domains. Similarly billions of nucleic acid sequences are
classified as unigenes, gene families, and functionally related genes. Apart from its easy
to retrieve options they help one to predict the structure and function of unknown gene or
protein easily.
Elements of Bioinformatics
The study of Bioinformatics has following aspects.
• Sequence analysis
• Genomics
• Proteomics
• Metabolomics
• Systems biology
• Application bioinformatics
Sequence analysis
Currently there are 52 billion nucleotide bases and 50 million sequences of DNA
in the GenBank. These DNA sequences belong to several species of organisms including
plants, animals, microbes and viruses. Equally amazing is the protein sequence data. The
Uniprot of EXPASSY server has several thousands of well curated unique protein
sequences. The first and basic approach in bioinformatics is analysis of the sequences.
There are two types of sequence comparison namely the pair wise sequence comparison
and the multiple sequence comparison.
Pair-wise sequence comparison

Two gene or two protein sequences can be compared to find out the relation ship
between them. How much they resemble each other? Do they come from common
ancestral gene? How much they have deviated from each other since their origin from the
common ancestor can be known from sequence comparison.
Prior to sequence similarity comparison they have to be aligned with each other.
The process of sequence alignment is a daunting job. For instance two sequences
consisting of each 100 residues will have ten thousand possibilities of alignment. Though
only one among them is the real alignment, to identify that alignment needs the help of
computer. The most probable alignment is called the optimal alignment. Optimal
alignment will have high similarity score. There are two competitive sequence alignment
algorithms namely the local alignment and the global alignment.
The local alignment created by Waterman and Smith tries to maximize the
alignment and achieve high similarity score even if the sequences align only by
fragments. As many proteins domain nature, they are related to each other only local
domains rather than to their entire sequence length. Attempting to align the two sequence
globally would miss the locally occurring significant similarity.
The global alignment algorithm of NeedleMan and Wunsch tries to align two
sequences from one end to the other. Both alignments will introduce necessary gaps in
the alignment to optimize the alignment.
Both these algorithm made a break through in the early days of computational
biology. Creation of suitable substitution matrix augmented the development of
Bioinformatics. Substitution matrix is a look up table for scoring the amino acids and
nucleic acids substitutions that are seen between related sequences.
During the late nineties of yester century, the number of sequences deposited into
the genbank mounted to several millions. A fast and accurate sequence comparison
algorithm was created in the place the original reliable but slow algorithms. Thus the
birth of BLAST (Basic Local Alignment Search Tool) occurred. Altshchul and others
created this ultra fast heuristic alignment and scoring tool. Immediately the
Bioinformatics took rapid stride towards gene annotation and rapid sequence retrieval.
Now BLAST has its own family of algorithms that perform various tasks.
Multiple Sequence Alignment

The complexity in the pair-wise alignment is tiny before the intricateness of
aligning several sequences together. Aligning several related sequences has a set a new
trend in the study of phylogeny. Branch of Molecular biology envisaged by Linus Pauling
become realized and labeled Molecular phylogeny. The following are the information
that can be extracted from the sequences on multiple alignment:
• The conserved (unchanged) portion of the sequences will be evident
• Conserved regions may be the secondary structures like alpha helix or beta
strands.
• The Residues that constitute active site of an enzyme, binding site for the
ligands, hetero atoms annexed places, trans membrane regions can be
identified.
• Residues that have mutated together but set apart in the sequence may be
identified.
• Assemblage of contigs into single sequence is possible (genome assembly)
• Multiple sequence alignment is the initial step in the phylogenetic analysis
of sequences.
Phylogenetic analysis
Organisms acquire new genes when they meet new environment. New genes help
the individual to survive better in the new environment. New genes are created from the
already existing genes by duplication and variation process. Some organisms borrow new
genes laterally from other species.
The family of genes that have arose due to duplication are related to each other as
siblings. These are homologous sequences buy specifically named as paralogous genes
because they exist in the same genome.
Those family of homologous genes that are distributed in several species and yet
have common origin are called orthologous genes. Molecular phylogeny rely upon
counting the number of mutational steps that have occurred between the sequences to
measure their phylogenetic distance. The number of mutations also serve as a kind of
molecular clock to calculate the time passed between the original common ancestor
sequence to the present sequence.
There are two major approaches in deciding the phylogenetic distance between
the sequences. Phenetic and Cladistic approaches are the two. Phenetic approach is very
straight forward, fast but artificial. The Cladistic approach is computationally intensive
but natural. Cladistic approach gives different weightage to aminoacid substitution based
upon the number of codons that have intervened from the ancestral state to current state.
Maximum parsimony and Maximum likelihood are the two Cladistic algorithms used in
phylogenetic analysis.
Genomics
At present (2007) there are nearly 1000 completed genomes deposited in the
Genbank. Genomes of model organisms like mouse, rat, drosophila, fish, Chimpanzee,
sea urchin, worm, yeast, Neurospora, Arabidopsis, rice, poplar and several hundreds of
bacteria, archaea and viruses are now available. Gene expression data, better known as
micro array experiment data are also accumulating in the public databases.
Genomic data analysis has three perspectives. Structural genomics, comparative
genomics and functional genomics are the three line of perspectives.
Genome is the complete set of information based on which a zygote develops and
grow into an adult organism. If bioinformatics is what it is believed to be, it should
decipher the codified genetic information that turn the single diploid cell into a complete
organism.
The objective of the genome sequencing is not only to unravel the mysteries that
shroud the developmental biology but also to improve the crop or to mitigate the human
sufferings from diseases. "If only I had the opportunity to have entire human genome in
my computer I wouldn't have spent 7 hard years to locate a single gene that was
responsible for Cystic fibrosis disorder. It would have taken only a few hours to do so"
said before Collins the Director of Human genome project completed the project.
Structural genomics
Structural genomics tries to locate the precise position of the genes and their
regulatory elements on the genome. Before genome sequence was available scientists
relied upon genetic mapping to locate a gene. Either the Genetic mapping (linkage
mapping) or physical mapping (RFLP, SINE or LINE repetitive sequence position
mapping) were not very precise as that Genome itself. Genome sequence alone is the
ultimate of the gene mapping because it places the genes to its exact nucleotide.
Structural genomics aims at mapping genes, their regulatory elements, exons,
introns, poly A tail start point, pseudo genes, paralogs, non coding conserved regions etc
on the genome.
Functional genomics
Assigning genes their function and their regulations expression under different
situations are called functional genomics. The recently introduced Micro array analysis of
entire genome under varied experimental conditions has revolutionized our understanding
of gene expression. In a single experiment studying the levels of expression of 10000
genes under varied physiologic conditions is possible now.
Micro array data for yeast gene expression under 600 different set of
physiological conditions are available with Stanford Microarray database. It is only a
fraction of the real amount of data available there.
These data offer biologists to compare the gene expression level under diseased
state and normal state. Comparison of cancer cells with normal cell would reveal those
genes that are involved in the cancer state. Genes that are under expressed, over
expressed, co expressed, contra expressed can be clustered and studied in detail in the
laboratories. The computer aided drug designers rely upon these analysis to find a
suitable drug target proteins.
Comparative genomics
Comparison of two or more genomes with each other is the major study in
bioinformatics. Genome level comparison of couple of species supply several
information such as:
• Genome evolution
• Differences in the metabolic pathways
• Distinguish the parasitic biology from saprophytic biology
• Distinguish the toxigenic genes from non toxigenic genes
• Drug resistant from susceptible
• Tolerant with intolerant
When two closely related genomes are compared with a distantly related genome
has revealed several highly conserved but non coding regions of genome. Genomes are
compared at several levels. One has to start with comparison at nucleotide level.
Dinucleotide frequency, codon preference, CpG islands etc will reveal interesting
differences between the organisms. Comparison at gene level would reveal evolution of
genes and gene organization, gene synteny, chromosomal rearrangement etc. Comparing
gene regulatory elements will reveal the operon systems. Finally one may go up to the
level of comparing the genes related to various metabolism and gene ontology.
Proteomics
'one gene one enzyme' has become a sentence of historic important, because one
gene several protein is the reality in eukaryotic systems now. That single gene can code
for more than one protein due to alternate splicing of exons has helped us to come out of
the riddle how come a cell posses number of proteins three times to that of genes! The
reason for our inability match the phenotypes with genotype is because we were stuck to
genes alone. Indeed the proteins are the real stalwarts that manifest the phenotypes. The
epigenetic mysteries become explainable.
Proteomics is documenting every individual protein of a cell. It includes
understanding the structure and function of the proteins. Just like genomics, proteomics
also has three perspectives. Structural proteomics deals with constructing protein
structures. Functional proteomics is about the expression of proteins in a cell under varied
conditions. Comparative Proteomics deals with comparison of entire protein complement
of two different cells or a cell under different conditions. Added to these perspectives
study of protein-protein interaction has become a major interest in biology.
Protein function prediction
Protein sequence data and protein structure data are very impressive. Though the
Uniprot boasts itself to be well curated protein database, yet 80 percent of proteins in a
given genome are only hypothetical or putative. Assigning function to the hypothetical
proteins is the major task of bioinformatics.
Structure prediction
Protein structures are elucidated by X-ray crystallography and Nuclear Magnetic
resonance imaging techniques. This is possible only after a particular protein has been
purified 100 % and made into a crystal. As the trans membrane proteins are not amenable
for purification and crystallization the number of entries in Protein Data Bank for trans
membrane proteins are limited. Unfortunately most of the proteins implicated in the
human diseases are membrane proteins a urgent need to construct the protein structure
out of its sequence knowledge become imminent.
Without the structure of the target protein computer aided drug designing is
impossible. Multi billion opportunity awaits person capable of modeling membrane
proteins theoretically.
Protein interaction
Proteins function in team. Proteins are capable of self assembly. Many proteins
assemble themselves into a single entity and bring about certain cell function
cooperatively. How proteins recognize their partners, and interact precisely is a
fascinating study. Though yeast dihybrid experiments are now widely used technique to
study the protein-protein interaction, bioinformatics method has greater scope in the near
future.
Metabalomics
Now a day's increasing amount of interest is shown in sequencing the genomes of
extremophilic micro organisms. Archaea are extremophilic organisms capable of living
under temperature nearing boiling point of water and pressure that exceeds thousands of
barometers. The genes and proteins of these organisms are extremely useful in
biotechnology. It may be recollected how the discovery of TAQ polymerase an enzyme
derived form an archaea have revolutionized the genetic engineering and biotechnology
with its PCR mechanism.
Bioinformatics skill can help a person to reconstruct the entire metabolic pathway
of an organism from its genomic sequence alone. Predicting the entire set of metabolism
from genomic sequences is possible now thanks to the metabolic pathway database of
KEGG.
Comparative genomics reveals only one fourth of the information. Proteomics
supplies another fifty percent and only the metabolomics completes the information.
While the former two are static the latter is dynamics of the cell. In other words
metabolomics fulfills the demands of Bioinformatics.
Systems Biology
The pinnacle of Bioinformatics resides in simulation of cellular process inside the
computer system. The cellular system is simulated in total by a computer system is
systems biology.
Cell with all its metabolic complements, regulations, energetics and dynamics can
be simulated only if we know entire cell biology. Simulating a human cell though may be
the ultimate goal, our current knowledge permits only for simulating a prokaryotic
minimal cell.
A minimal cell is one which is capable of living independently and carry out all
its biological activities such as growth, reproduction and adjustments to environment with
minimum number of genes. The smallest genome comprising only 500 and odd genes
belong to a couple of parasitic bacteria. Truly independent cell might need at least 1000
genes. If one has achieved the knowledge of minimal gene the true E Cell will go ahead
with its fullness.
Cell simulation has enormous potential. It is a boon for biotechnologists. They
can study cell expression, metabolic regulation and perform several gene manipulation
and expression virtually. The virtual cell would predict the toxicology, pharmaco
dynamics and pharmacokinetic properties of drugs without even toughing mouse or rat in
the laboratory. Animal activist would rest in peace.
Bioinformatics spin off

Though the main objectives of Bioinformatics is storage, maintenance, update and
supply of biological data in one hand and extraction of information and converting them
into biological knowledge the other, it has generated several useful spin off techniques to
the biologists. Several bioinformatics tools have relieve biologists from drudgery of
repetitive operations and trial and error methodologies.
• Identifying disease target proteins from micro array data or from
comparative genomics studies
• Primer designing: the short flanking sequences of a gene which is used use
as the primer for PCR reaction can be predicted with precision
• Using disease markers for disease diagnosis
• Selection of suitable plasmid vectors, restriction enzymes for gene
cloning.
• Prediction of antigenic site of a protein for developing vaccines
• Distinguishing strains and species through phylogenetic analysis
Computer Aided drug design

The convergent point of bioinformatics is knowledge based drug discovery. Every
year 10,000 new chemicals are being sent for approval by FDA. Only 5 percent of them
are rejected. There are 75000 drug molecules in the market. More than a million natural
substances are known to exist in nature. But why there are very few promising drugs that
are specific and devoid of side effects.
National Cancer Institute (US) has screened 10,000 natural substances every year
for the past decades on 60 different types cancer cell lines and come up with just handful
of substances that worked really on them.
These approaches are blind and irrational. Bioinformatics offers the rational and
knowledge based drug design. The computer aided drug design has following phases in
the discovery.
1. Identification of the disease target gene (protein). Micro array based
identification. Comparative genomics, consulting OMIM database and
extensive literature mining
2. Validation of the target. Knock out experiment, SNP analysis
3. Constructing the 3D structure of the target protein. Homology modeling or
classical crystallography approach.
4. Locating the ligand interacting site on the target molecule. Employing
appropriate computational tool.
5. De novo lead construction. Based on the receptor site, lead is constructed
either link or grow method.
6. Use of QSAR analysis using homologation or bioisosterism approach in
converting already known lead into more potent and safe.
7. Screening the lead candidate for ADME/TOX properties and fulfilling
Lipinski's rules.
8. Dock the selected lead on to the target and fine tune the structure further if
needed.
9. Ask the chemist to synthesize the drug or prodrug.
10. Go for clinical trials.
Conclusion
Like any other branch of science Bioinformatics also has started with a humble
objectives and cold reception. But unlike the other scientific discipline, it has soared to
unimaginable heights. It is a kind of science totally different from others in viewing a
problem from whole to part, rather than from part to the whole. Its Gestalt view of the
problem and fathered by several branch of science are unique. Molecular biologists,
pathologists, biologists, computer scientists, statisticians, mathematicians, biophysicists
and biochemists have joined their hands in the development of Bioinformatics.
Biology is no longer the same now. If a biologist leaves out bioinformatics they
are lame. The future battle with human suffering and conserving nature is going to be
fought with the weapons of Bioinformatics.

Bioinformatics A Starter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics A Starter

Uploaded by

Copyright:

Available Formats

Bioinformatics

The two pillars of bioinformatics

The nature of Biological data

There are three kinds of biological data.

The Big three

Pair-wise sequence comparison

Multiple Sequence Alignment

Bioinformatics spin off

Computer Aided drug design

You might also like