You are on page 1of 121

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/307478079

Bioinformatics Practical Manual

Book · January 2014

CITATIONS READS

0 10,266

2 authors:

Kailash Chandra Samal Gyana Ranjan Rout


Orissa University of Agriculture & Technology Siksha O Anusandhan University
347 PUBLICATIONS 714 CITATIONS 392 PUBLICATIONS 8,643 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Kailash Chandra Samal on 31 August 2016.

The user has requested enhancement of the downloaded file.


Bioinformatics Practical Manual

Kailash Chandra Samal


Amrita Priyadarsini
Gyana Ranjan Rout
Anath Bandhu Das
Iswar Ch. Mohanty

Department of Agricultural Biotechnology


College of Agriculture
Orissa University of Agriculture & Technology
Bhubaneswar
2014
Bioinformatics Practical Manual

First Edition: 2014

©Copyright 2014 Orissa University of Agriculture and Technology,


Bhubaneswar, Odisha

All rights reserved. No part of this publication may be reproduced,


stored in retrieval system or transmitted by any means electronic, mechanical,
photocopying, recording or otherwise, without prior written permission of
publisher

Printed at:
Bhubaneswari Traders, Nayapalli, Bhubaneswar
Preface
The idea of writing a Bioinformatics Practical Manual originated from
our experience of teaching biotechnology and bioinformatics at Orissa University
of Agriculture and Technology, Bhubaneswar. Odisha. The students are needed a
write-up that was comprehensive enough to cover all major aspects in the field,
technical enough and sufficiently up to date to include most current development
while at the same time being logical and easy to understand. The student interest
motivated us to write this bioinformatics manual to alleviate the problem. It is
written specifically for the biotechnology and bioinformatics students where the
basics of bioinformatics are explained. All key areas of bioinformatics are covered
including biological databases, sequence alignment, gene and promoter prediction,
molecular phylogenetics, structural bioinformatics, genomics, and proteomics. The
manual emphasizes the different practical aspects of bioinformatics. Efforts have
been made to include all essential aspects of Bioinformatics. It is hoped that this
publication will help the teachers, students and technicians to upkeep their
practical knowledge on various dimension of Bioinformatics.
We are grateful to Prof. Manoranjan Kar, Vice Chancellor for his
encouragement and valuable guidance in bringing out this publication. The
constant encouragement and guidance of Prof. B.K. Mishra, Dean, College of
Agriculture for preparation of this publication is duly acknowledged. The help of
the ICAR in granting financial assistance for bringing out this publication is
gratefully acknowledged.

Gyana Ranjan Rout


Kailash Chandra Samal
Message-1
Visit us at: www.cabbsouat.org
Tel & Fax : (0674) 2397375 (O), 2561458 (R)
Fax: 0674-2397780
Email: deanca@rediffmail.com
College of Agriculture
ORISSA UNIVERSITY OF AGRICULTURE & TECHNOLOGY
BHUBANESWAR 751003, ODISHA

Prof. B.K. Mishra


DEAN Dated the 31st March, 2014

Message
Economic growth and development in India continues to be propelled by
growth in agriculture and allied sectors. This can only be done through
technological advancements and competent human resource to serve the needs of
farmers. Today, the agricultural production through most conventional science and
technology innovations has reached a plateau. Therefore, there is a need to break
the plateau. Thus to put the country’s agricultural growth on fast track,
development of cutting edge technologies such as Biotechnology and
Bioinformatics are the need of the hour. Biotechnology is based on techniques
involving genes, genomes, nucleic acids and other related macro and micro
biomolecules. Bioinformatics apply computer based information technology for
storage, retrieval and analysis of vast databases being generated on genes,
genomics and nucleic acids.
I am delighted that the Department of Agricultural Biotechnology, College
of Agriculture, OUAT is going to publish “Bioinformatics Practical Manual” for
UG, PG and Ph. D. students of agriculture. This publication will strengthen the
knowledge of students, researchers and faculty members on various techniques in
the areas of bioinformatics. I am confident that this manual will be very helpful for
the students, researchers and faculty members.
Content
Chapter / Page
Particulars
Exercise No
Chapter 1 Biological background 1
Chapter 2 Scope and application of bioinformatics 16
Chapter 3 Databases and its structure 28
Chapter 4 Biological database 33
Chapter 5 Database retrieval system 47
Chapter 6 Cataloging biological database 49
Chapter 7 Pairwise Sequence Alignment 54
Chapter 8 Multiple sequence alignment 57
Chapter 9 Practical exercises 58
Exercise 1. Making search for the scientific literature and sequences 58
Exercise 2. Characterization of a known Gene 61
Exercise 3. Finding out open reading frames (ORF) 65
Exercise 4. Translating an unknown DNA Sequence 68
Exercise 5. Identifying a gene using BLAST program 71
Exercise 6. Finding Domains in Protein Sequences 74
Exercise 7. Nucleotide BLAST (BLASTn) 76
Exercise 8. Protein BLAST (Blastp) 79
Exercise 9. Translated BLAST (Blastx) 81
Exercise 10. tBLASTX 83
Exercise 11. Position Specific Interacted BLAST (PSI-BLAST) 85
Exercise 12. FASTA 88
Exercise 13. Editing and analyzing multiple sequence alignment using 91
Jalview
Exercise 14. Making multiple alignment with T-coffee 93
Exercise 15. Online Mendelian Inheritance in Man (OMIM) 96
Exercise 16 Protein Structure Database 98
Exercise 17. Depositing sequences in database 99
Exercise 18. Submitting sequences to Genbank through BankIt 100
Exercise 19. Submitting sequences to Genbank through ‘Sequin’ 103
Exercise 20. Primer Designing 107
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 1
Biological background
Bioinformatics is a tool for providing insight into the structures and
functions of biomolecules: DNA, RNA and proteins. In particular, bioinformatics
deals with the task of understanding information chemically encoded into life that
controls the structural processes ongoing in all living organisms. Bioinformatics is
usually concerned with applying statistical and computational methods to analyze
biological data obtained from wet lab experiments, sequencing projects or the
simulation of protein-protein interactions and how this can help us in understand
the evolution of organisms and biological processes. It also provides an insight of
the Central Dogma of Molecular Biology which characterizes the mappings
between different types of biopolymer (DNA, RNA and protein). Strictly speaking,
the Central Dogma is a list of usual transitions between different biomolecules
within an organism. The theory classifies three styles of `maps' between
biopolymers as follows:
(i) General transfers:
(a) DNA to DNA (DNA replication)
(b) DNA to RNA (transcription)
(c) RNA to protein (translation)
(ii) Special transfers:
(a) RNA to RNA (RNA replication)
(b) RNA to DNA (reverse transcription)
(c) DNA to protein (direct translation)
(iii) Unknown transfers:
(a) Protein to RNA
(b) Protein to DNA
(c) Protein to protein
General transfers are those that happen continuously in organisms, whereas
special transfers happen rarely and often only in special situations. No unknown
transfers are recorded to have happened, although prions, which can manipulate
proteins, may be considered by some to affect protein to protein transfers. DNA to
[1]
Bioinformatics Practical Manual K. C. Samal et al.

RNA to protein as the study of this particular process yields the most practical
applications in areas such as gene therapy. It is important to understand the nature
and function of the biopolymers themselves and also the mechanisms connecting
them and that will be the aim of this introduction.

Biopolymer: DNA
DNA ((Deoxyribonucleic acid) is a helical linear biopolymer. DNA is a helix-
shaped molecule whose constituents are two parallel strands of nucleotides. There
are four types of nucleotides in DNA and they correspond to the letters A (for
adenine), T (thymine), C (cytosine) and G (guanine). DNA is usually represented
by sequences of these four nucleotides. An A on one strand always pairs with a T on the
other opposite strand through two hydrogen bonds, while a C always pairs with a G through
three hydrogen bonds as these nitrogenous bases are complementary to each other. Thus, two
strands are, therefore, complementary to each other and one helix starts from 5’ to 3’
direction while other helix starts from 3’ to 5’ directions. The sequential arrangement of the
individual nucleotides is responsible for giving uniqueness to any individual living form be it
humans, animals, plants, or microbes
[2]
Bioinforrmatics Practical Manual K. C. Saamal et al.

This asssumes that only one strand


s is coonsidered; the secondd strand is always
derivaable from the first by
b pairing A’s with T’s T and C’s with G’s and vicee versa.
That derivation is called finding
f the reverse coomplementtary pair off a strand.
DNA is the chem mical basis of life thatt complexees with prooteins to foorm the
chrom
mosomes. The
T double helix struccture of DN NA (B form m) was disscovered byy James
Watsoon and Fran
ncis Crick 19953 and got Nobel prizee during 19662.
The DN NA is spplit into disjoint intervals made m from
m sequennces of
nucleeotides. These intervaals are cateegorized ass either inttrons or exons, wheree exons
are thhose parts of
o the DNA A that are being
b activvely transccribed into RNA and introns
(or juunk DNA)) are thosee intervals that are not.
n DNA can be coonsidered to t be a
mostlly constan nt trait in a given multi-cellul
m lar organissm, as it varies neggligibly
betweeen the cellls under noormal condditions.

[33]
Bioinformatics Practical Manual K. C. Samal et al.

Biopolymer: RNA
Conversely, the transcribed RNA produced in an organism, though it is
derived from DNA and is structurally similar (however, not helical), varies with
regard to several factors, including time and environmental factors such as
intracellular chemical gradients. The theory of how these extraneous factors affect
the derivation of RNA from DNA is of specific importance to the bioinformatical
projects. RNA preserves the information stored in DNA, as the nucleotides present
in RNA `complement' the nucleotides of DNA, except that adenine in now
complemented with uracil.
Biopolymer: Protein
Proteins are the active agents that govern the metabolic, structural and
signaling processes at work in an individual organism. The translational map
creating protein from mRNA (messenger RNA, the specific type involved in RNA
to protein translation) is mostly determinable by the underlying mRNA and, in fact,
each `codon' (or sequence of three RNA nucleotides) corresponds exactly to an
amino acid (or a start codon, end codon or an untranslated triplet) - the constituent
building blocks of proteins.
Chromosomes and Genes
Each chromosome is a long piece of DNA. Human has 46 chromosomes (2
sets of 23, one set from each parent) and contains 3.12 billion nucleotides (bases).
Genes are just regions on that DNA. Genes are contiguous subparts of single
stranded DNA that are templates for producing proteins. Genes can appear in either
of the DNAs strands. The set of all genes in a given organism is called the genome
for that organism. The function of DNA material between genes is largely
unknown. Certain intergenic regions of DNA (called noncoding) are known to play
a major role in cell regulation, the process that controls the production of proteins
and their possible interactions with DNA. Proteins are produced from DNA using
three operations or transformations called transcription, splicing, and translation.
DNA is capable of replicating itself. The cell machinery that performs that task is
called DNA-polymerase. Biologists call the capability of DNA for replication and

[4]
Bioinformatics Practical Manual K. C. Samal et al.

undergoing the above three (or two) transformations the central dogma. Genes are
transcribed into pre-RNA by a complex ensemble of molecules called RNA-
polymerase. During transcription the nucleotide T (thymine) is substituted by
another one designated by the letter U (for uracil). Pre-RNA can be represented by
alternations of sequence segments called exons and introns. The exons represent
the parts of pre-RNA that will be expressed, that is, translated into proteins. Next
comes the operation called splicing; an ensemble of proteins called the spliceosome
performs it. Splicing consists of concatenating the exons and excising the introns to
form what is known as mRNA, or simply RNA. The final phase, called translation,
is essentially a “table look-up” performed by complex molecules called ribosomes
(an ensemble of RNA and proteins). Translation repeatedly considers a triplet of
consecutive nucleotides in RNA and produces one corresponding amino acid. The
triplet is called a codon. In RNA, there is one special codon called a start codon
and a few others called the stop codons. An open reading frame (ORF) is a
sequence of codons starting with a start codon and ending with an end codon. The
ORF is thus the sequence of nucleotides that is used by the ribosome to produce the
sequence of amino acids that makes up a protein. There are basically 20 amino
acids but, in certain rare situations, others can be added to that list. Since there are
64 different codons and 20 amino acids, the “table look-up” for translating each
codon into an amino acid is redundant in the sense that multiple codons can
produce the same amino acid. The “table” used by nature to perform translation is
called the genetic code. Due to the redundancy of the genetic code, certain
nucleotide changes in DNA may not alter the resulting protein. Once a protein is
produced, it folds (most of the time) into a unique structure in 3D space. In the 3D
representation of a protein, one can distinguish three different types of components:
α-helices, β-sheets and coils. The secondary structure of a protein is its sequence of
amino acids, annotated to distinguish the boundaries of each component: helices,
sheets, and coils. The tertiary structure of a protein is its 3D representation. The
function of a protein is the way it participates with other proteins and molecules in
keeping the cell alive and interacting with its environment. Function is closely
related to tertiary structure. In functional genomics, one studies the function of all
[5]
Bioinformatics Practical Manual K. C. Samal et al.

the proteins of a genome. One of the important goals of bioinformatics is to help


biologists in deciphering the function of proteins.
Genes and Proteins
Most genes code for proteins: each gene contains the information necessary
to make one protein. Proteins are the most important type of macromolecule.
Proteins are of different types- Structural protein: collagen in skin, keratin in hair,
crystallin in eye. Enzymes protein: all metabolic transformations, building up,
rearranging, and breaking down of organic compounds, are done by enzymes,
which are proteins. Transport protein: oxygen in the blood is carried by
hemoglobin, everything that goes in or out of a cell (except water and a few gasses)
is carried by proteins. Also: nutrition protein (egg yolk), hormones, defense,
movement.
Open Reading Frames
Since codons consist of 3 bases, there are 3 “reading frames” possible on an
RNA (or DNA), depending on whether you start reading from the first base, the
second base, or the third base. The different reading frames give entirely different
proteins. Consider ATGCCATC, and refer to the genetic code. (X is junk) Reading
frame 1 divides this into ATG-CCA-TC, which translates to Met-Pro-X. Reading
frame 2 divides this into A-TGC-CAT-C, which translates to X-Cys-His-X.
Reading frame 3 divides this into AT-GCC-ATC, which translates to X-Ala-Ile
Each gene uses a single reading frame, so once the ribosome gets started, it just has
to count off groups of 3 bases to produce the proper protein. Another example of
reading frames are shown below.

[6]
Bioinformatics Practical Manual K. C. Samal et al.

For findings a gene, first job is to find long ORFs, examining the longest
ORFs first and putting together a set with minimal overlaps. It is also necessary to
identify potential start codons, with the furthest upstream start codon as the easiest
choice. Then, how do we know that the ORF contains a real gene? The most
definitive way is to match it with a gene known from other species conservation of
a sequence between species strongly suggests that the sequence has a function that
is being conserved by natural selection We compare protein sequences, not DNA,
because protein is more conserved in evolution than DNA The organism’s survival
depends on the protein being functional, which means having the proper amino
acids sequence Since the genetic code is degenerate, many different DNA
sequences will give identical proteins. The protein 3-dimensional structure is even
more conserved, because it is more closely related to enzyme activity than the
amino acid sequence is. However, we don’t have good ways of determining 3-D
structure from a DNA sequence.

[7]
Bioinformatics Practical Manual K. C. Samal et al.

Genetic Code
Proteins are long
chains of amino acids.
There are 20 different
amino acids coded in DNA.
There are only 4 DNA
bases, so you need 3 DNA
bases to code for the 20
amino acids 4 x 4 x 4 = 64
possible 3 base
combinations (codons).
Each codon codes for one
amino acid. Most amino acids have more than one possible codon. Genes start at a
start codon and end at a stop codon. Three codons are stop codons. All genes end at
a stop codon. Start codons are a bit trickier, since they are used in the middle of
genes as well as at the beginning in eukaryotes, ATG is always the start codon,
making Methionine (Met) the first amino acid in all proteins (but in many proteins
it is immediately removed). In prokaryotes, ATG, GTG, or TTG can be used as a
start codon.
Gene Expression
How do you get a protein from a gene? A two-step process (called the
Central Dogma of Molecular Biology). First, the gene has to be copied
(transcribed) into an RNA form. The RNA copy (messenger RNA) is exactly like
the gene itself, except RNA replaces T with U. The RNA is translated into protein
by ribosomes, which are complex RNA/protein hybrid machines. With the help of
transfer RNA molecules, which have one end that matches the 3 base codon and the
other end that is attached to the proper amino acid. The ribosome starts at the start
codon and moves down the messenger RNA, adding one amino acid at a time to the
growing chain. When the ribosome reaches a stop codon, it falls off, releasing the
new protein.

[8]
Bioinformatics Practical Manual K. C. Samal et al.

Transcription (Nucleus):
¾ In the nucleus, an enzyme called DNA helicase causes the twisted DNA
molecule to unwind.
¾ One strand of the DNA is used as the template strand for RNA synthesis.
¾ RNA polymerases begins synthesizing RNA from the DNA template at the
promoter sequence (a sequence that lets the RNA polymerase know where
to begin).
¾ When RNA is synthesized, it is called mRNA (messenger RNA) and leaves
the nucleus and goes to the cytoplasm.
Translation (Cytoplasm):
¾ In the cytoplasm, rRNA (ribosomal RNA), which consists of a small and
large subunit, comes together to provide a site for translation to occur.
¾ tRNA (transfer RNA) is the RNA responsible for picking up which amino
acid should be added to the chain next.
¾ mRNA, rRNA, and tRNA all come together to perform translation.

[9]
Bioinformatics Practical Manual K. C. Samal et al.

¾ mRNA codes for a specific amino acid, tRNA retrieves that amino acid,
and rRNA provides a surface for this to occur.
¾ When tRNA brings back the correct amino acid, a polypeptide chain is
started.
¾ One amino acid is added at a time, and they are connected with peptide
bonds.
¾ When the chain is finished, a protein is formed.

Genetic marker
A DNA polymorphism that can be easily detected by molecular or
biochemical analysis. The marker can be within a gene or in DNA with no known
function. Because DNA segments that lie near each other on a chromosome tend to
be inherited together, markers are often used as indirect ways of tracking the
inheritance pattern of a gene that has not yet been identified, but whose
approximate location is known.
Primer
A short (single strand) oligonucleotide sequence of 10-15 nucleotides used in a
polymerase chain reaction (PCR)
PCR
The development of the polymerase chain reaction (PCR) was a
technological breakthrough by Kary Mullis in 1985 who got the Nobel Prize during
1993. The principle of PCR is very simple. It is based on the function of a copying
enzyme, Taq DNA polymerase (obtained from a bacteria Tharmus acuaticus, a
microbial habitat of hot spring), which is able to synthesize a duplicate molecule of
DNA from a DNA template which is bracketed by the primer. The product of
duplication of the original template DNA becomes a second template for another
round of duplication. Repeated duplications thus lead to an exponential increase in
DNA product accumulation. Even when starting from a single DNA molecule,
detectable amounts of target DNA are generated by PCR in a few hours. DNA
polymerase was first isolated from Thermus aquaticus in 1976. In 1989 Science

[10]
Bioinformatics Practical Manual K. C. Samal et al.

magazine named Taq polymerase as its first "Molecule of the Year". In 1993, Dr.
Mullis was awarded the Nobel Prize for his work with PCR.
DNA fingerprinting
A technique used by scientists to distinguish between individuals of the
same species using only samples of their DNA. It is a technique, by which an
individual can be identified at molecular level. With the advancement of science
and technology VNTR (Variable Number of Tandem Repeats) and STR (Short
Tandem Repeats) analysis has become very popular in forensic laboratories. The
process of DNA fingerprinting was invented by Alec Jeffreys at the University of
Leicester in 1985 in England and was knighted in 1994.
Scientists have chosen repeating sequences in the DNA, which are present
in all individuals on different chromosomes, and are known to vary from individual
to individual. These are used as genetic markers to identify the individual. DNA
fingerprinting technique has been successfully used for identification of plant
species or cultivar, detection seed purity, detection of adulteration in food and seed
and other planting material. This technique also resolves disputes of maternity
/paternity, identification of cultivars or breeding material, forensic wildlife,
protection of farmers’ rights and biodiversity. This remarkable technology provides
positive identification with virtually 100% precision.
DNA profile of an individual is unique. It can never be identical even in
biologically related individuals except for the identical (monozygotic) twins. The
chances of two people having exactly the same DNA profile are 30,000 million to 1
(except for identical twins).
Any biological material such leaf, seed, plant parts in case of plant and a drop of
blood, saliva, semen, and any body part such as bones, tissue, skull, teeth, hair with root in
case of animal and human being.
Molecular markers
Molecular markers in life sciences is defined as a DNA sequence or a
cytogenetic segment or a chromosome fragment or a protein or an enzyme used as

[11]
Bioinformatics Practical Manual K. C. Samal et al.

an experimental probe to keep track of an individual, a tissue, a cell, a nucleus, a


chromosome or a gene. In general, different types of markers are used in life
sciences and uses of molecular markers are given below.
¾ Assessment of genetic variability and fingerprinting of genotypes
¾ Mapping of monogenic and qualitative trait loci (QTL) of economically
important traits
¾ Estimation of genetic distance or degree of relatedness between population,
inbreds and breeding materials or among groups of accessions in
germplasm
¾ Identification of sequences for candidates genes and economic breeding
traits
¾ Marker assisted selection for crop improvement in tissue cultured plant
species
¾ Genetic purity testing of seeda and micro-propagated plantlets
¾ Characterization and evaluation of plant genetic resources and its
conservation
¾ Screening transgenic plants for resistance genes using linked molecular
markers
The different molecular markers are
(a) Restriction fragment length polymorphisms (RFLPs)
Restriction fragment length polymorphisms (RFLPs) are identified using
restriction enzymes that cleave the DNA only at precise “restriction sites” (e.g.
EcoRI cleaves at the site defined by the palindrome sequence GAATTC). At
present, the most frequent use of RFLPs is downstream of PCR (PCR–RFLP), to
detect alleles that differ in sequence at a given restriction site. A gene fragment is
first amplified using PCR, and then exposed to a specific restriction enzyme that
cleaves only one of the allelic forms. The digested amplicons are generally resolved
by electrophoresis.
Advantages
¾ RFLPs are co-dominant and can differentiate heterozygote from
homozygote.
[12]
Bioinformatics Practical Manual K. C. Samal et al.

¾ It is more sensitive and most reliable marker technique.


¾ It can able to identify a unique locus
Disadvantage:
¾ The technique is laborious, costly and involves several time consuming,
tedious steps.
¾ The detection system uses radioisotope or complex biochemistry.
¾ It requires large amount of high quality DNA.
¾ It requires species specific primers/ probes.
¾ It is not suitable for high scale analysis of varieties/ genomes.
¾ Automation is not possible
(b) RAPD marker
RAPD (random amplification of polymorphic DNA) is a PCR-based
method which employs short synthetic oligo-nucleotides (10 – 12 bases long) of
random sequences as primers to amplify DNA fragments from genomic template
DNA under low annealing temperatures. Amplification products are generally
separated on agarose gels and stained with ethidium bromide. The amplified DNA
fragments have been visually scored and used for different analysis.
Advantage
¾ The RAPD technique is simple, cost effective.
¾ The procedure requires very small amounts of DNA and don't require
cloning or prior knowledge of sequence of genome.
¾ Same primer can be used across the genome.
¾ Suitable for large scale analysis of genotypes
¾ Automation is possible and requires no radioactivity.
Disadvantage
¾ RAPDs are commonly dominant markers. The heterozygote can’t be
differentiated from homozygote.
¾ RAPD is less reliable.

[13]
Bioinformatics Practical Manual K. C. Samal et al.

(c) Microsatellites or SSR


Microsatellites or SSR (Simple Sequence Repeats) or STR (Simple Tandem
Repeats) consist of a stretch of DNA a few nucleotides long – 2 to 6 base pairs (bp)
– repeated several times in tandem (e.g. CACACACACACACACA). They are
spread over a eukaryote genome. Microsatellites are of relatively small size, and
can, therefore, be easily amplified using PCR from DNA extracted from a variety
of sources including blood, hair, skin or even faeces. Polymorphisms can be
visualized on a sequencing gel, and the availability of automatic DNA sequencers
allows high-throughput analysis of a large number of samples (Goldstein and
Schlötterer, 1999; Jarne and Lagoda, 1996). Microsatellites are hypervariable; they
often show tens of alleles at a locus that differ from each other in the numbers of
the repeats. They are still the markers of choice for diversity studies as well as for
parentage analysis and Quantitative Trait Loci (QTL) mapping, although this might
be challenged in the near future with the development of cheap methods for the
assay of SNPs. FAO has published recommendations for sets of microsatellite loci
to be used for diversity studies for major livestock species, which were developed
by the ISAG–FAO Advisory Group on Animal Genetic Diversity (see DAD-IS
library http://www.fao.org/dad-is/).
Advantage
¾ The technique is simple
¾ It requires little DNA, faster and cost effective.
¾ Microsatellite markers are co-dominant.
¾ These markers are abundant, distributed evenly throughout the genome,
show high level of polymorphism compared to other marker.
¾ It is useful especially for analyzing closely related genotypes.
¾ Suitable for large scale analysis of genotypes.
Disadvantage
¾ It requires species specific primers
¾ The technique requires development of marker.
¾ The cost of microsatellite markers is high.
[14]
Bioinformatics Practical Manual K. C. Samal et al.

(d) Amplified fragment length polymorphisms (AFLPs)


Amplified Fragment Length Polymorphism is a molecular marker generated by a
combination of restriction digestion and PCR amplification.
Advantage
¾ AFLPs are highly polymorphic, evenly distributed throughout the plant
genome and hence serve as useful tool for various genetic studies.
¾ It is suitable for large scale analysis of genotypes.
¾ The technique can be used for DNA of any origin or complexity
¾ It combines the advantages of both RFLP and RAPD
Disadvantage
¾ AFLPs are mostly dominant in nature and hence heterozygote can’t be
differentiated from homozygote.
¾ Requires high quality DNA.
¾ Procedure is little bit complex and requires careful handing
(e) Sequence Tagged Site (STS)
STS (Sequence Tagged Site) are DNA sequences that occur only once in a genome,
in a known position. They needn’t be polymorphic and are used to build physical maps.
(f) Single Nucleotide Polymorphism (SNP)
SNPs are variations at single nucleotides which do not change the overall length of
the DNA sequence in the region. SNPs occur throughout the genome. They are highly
abundant. Most SNPs are located in non-coding regions, and have no direct impact on the
phenotype of an individual. However, some introduce mutations in expressed sequences or
regions influencing gene expression (promoters, enhancers), and may induce changes in
protein structure or regulation. These SNPs have the potential to detect functional genetic
variation.

[15]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 2
Scope and application of bioinformatics
Bioinformatics is the field of science in which biology, computer science
and information technology merges to form a single discipline. It is the collection,
organization, analysis, presentation and sharing of biological data to solve
biological problems on the molecular level. It is an interdisciplinary scientific field
that develops methods for storing, retrieving, organizing and analyzing biological
data. A major activity in bioinformatics is to develop software tools to generate
useful biological knowledge. Bioinformatics uses many areas of computer science,
statistics, mathematics and engineering to process biological data. Databases and
information systems are used to store and organize biological data. Analyzing
biological data may involve algorithms in artificial intelligence, soft computing,
data mining, image processing, and simulation. The algorithms in turn depend on
theoretical foundations such as discrete mathematics, control theory, system theory,
information theory, and statistics. Commonly used software tools and technologies
in the field include Java, C#, XML, Perl, C, C++, Python, R, SQL, CUDA,
MATLAB, and spreadsheet applications.
The term bioinformatics was coined by Pauline in 1979 for the study of
information processes in biotic systems. The National Center for Biotechnology
Information (NCBI, 2001) defines bioinformatics as “Bioinformatics is the field of
science in which biology, computer science and information technology merges to
form a single discipline. There are three important sub disciplines within
bioinformatics: the development of new algorithms and statistics with which to
access relation among member of large data sets, the analysis and interpretation of
various type of data including nucleotide and amino acid sequences, protein
domain and protein structure, and development and interpretation of tools that
enable efficient access and management of different type of information”.
Bioinformatics is a science discipline that has been emerged in response to
accelerating demand for a flexible and intelligent means of storing, managing and
querying large and complex biological data sets. The ultimate aim of
[16]
Bioinformatics Practical Manual K. C. Samal et al.

bioinformatics is to enable the discovery of new biological insight as well as to


create a global perspective form which unifying principles in biology can be
discerned. Over the past decade rapid developments in bioinformatics technologies
have combined to produce a tremendous amount of information related to
molecular biology. At the beginning of genomic revolution, the main concern of
bioinformatics was the creation and maintenance of a database to store biological
information such as nucleotide and amino acid sequence. Development of this type
of database involved not only to design issue but development of the interface
where by researchers could both access existing data as well as submit or revised
data (e.g to the NCBI, http;//www.ncbinlm nih.gov/). More recently, emphasis
has shifted towards the analysis of large data sets, particularly those stored in
different formats in different databases. Ultimately, all of this information must be
combined to form a comprehensive picture of normal cellular activities so that
researcher may study how these activities are altered in different disease states.
Therefore, the field of bioinformatics has evolved such that most pressing task,
now introduced to analysis and interpretation of various types of data, including
nucleotide and amino acid sequence, protein domain and protein structure.
Origin and history of Bioinformatics
Gregore Mendel, ‘father of Genetics’, illustrated that the inheritance of
traits is controlled by factor passed down from generation to generation. After this
discovery of Mendel, bioinformatics and genetic record keeping have come long
way. The understanding of genetics have advance remarkably in the last thirty
years. In 1972, Paul Berg made the first recombinant DNA molecules using ligase.
In that same year, Stanley Cohen, Annie Chang, Robert Helling and Herbert Boyer
showed that extra-chromosomal bits of DNA called plasmids act as vectors for
maintaining cloned genes in bacteria. The discovery is a major breakthrough for
genetic engineering, allowing for such advances as gene cloning and the
modification of genes. In 1973, two important things happened in the field of
genomics. Joseph Sambrook led a team that refined DNA electrophoresis using
agarose gel, and Herbert Boyer and Stanely Cohen introduced DNA cloning. In
1977, and method for sequencing DNA was discovered and the first genetic
[17]
Bioinformatics Practical Manual K. C. Samal et al.

engineering company, Genetech was founded. During 1981, 579 genes had been
mapped and mapping by in situ hybridization had become a standard method for
automated DNA sequencing. In 1988, the human genome organization (HUGO)
was founded. This is an international organization of scientist involved in Human
genomic project. In 1989, the first complete genome map was published of bacteria
Himophilus influenza. The following year, human Genome project was started in
1991. A total of 1879 human genes had been mapped. In 1993, Genethon, a human
genomic research center in France produced a physical map of human genome.
Three year later, Genethon published the final version of the Human genetic map
which included data from patients, preclinical and clinical trials and metabolic
pathway of numerous species.
Challenges:
The greatest challenge facing the molecular biology community today is to
make sense of the wealth of data that has been produced by the genome sequencing
projects. Cells have central core called nucleus, which is storehouse of an
important molecular known as the genome. Gene are specific region of the genome
(about 1%) spread through genome, sometime contiguous, many times non
contiguous. RNA similarly contain information, their major purpose is to copy
information from DNA selectively and to bring it out of the nucleus for its use.
Protein is made of amino acids, which are twenty in count. The gene, regions of the
DNA in the nucleus of the cell, is copied into the RNA and RNA travels to protein
production sites and is translated into protein is the Central dogma of molecular
biology.
Difference between bioinformatics and computational biology
Both bioinformatics and computational biology are computers and biology.
Biologists specialize in use of computational tools and systems to answer problems
of biology are bioinformaticians. Computer scientist, mathematicians, statisticians
and engineers who specialize in developing theories, algorithms and technique for
such tools and systems are computational biologists. The actual process of

[18]
Bioinformatics Practical Manual K. C. Samal et al.

analyzing and interpreting data is referred to as computational biology. Important


some discipline with bioinformatics and computational biology include:
Bioinformatics has become a mainstay of genomics, proteomics and all
others omics (such as phonemics) and many information technology companies
have entered the business which creats an IT and BT convergence. A
bioinformatician is an expert, who knows how to use bioinformatics tools, but also
knows how to write interfaces for effective use of the tools. A computational
biologist, on the other hand, is trained individual who only knows to use
bioinformatics tools without a deeper understanding.
Application of bioinformatics
Bioinformatics, an emerging area offering a fundamental tool to the
scientific community, aim to speed up the research, application and
commercialization of Biotechnology. It is the marriage between biotechnology and
information technology leading to the growth and development of this field. The
genomic revolution speeds up the central role of bioinformatics in understanding
the very basic of life processes. Over the last decade, biologist have handled a
number of genome research projects that include DNA sequencing, Proteomics,
expression studies and metabolomics. The completely sequenced genomes are
Human (Homo sapiens), mouse (Mus musculus), insect (Drosophila melanogaster),
plant (Arabidopsis thaliana), yeast (Saccharomyces cerevisiae), bacteria
(Escherichia coli, Vibrio cholerae), worm (Caenorhabditis elegans) and their
complete sequence information have stored in various public databases. Results
from genomic study now become immensely important in biological and medical
research. Therefore vast amount of biological information need to be stored,
organized and indexed so that the information can be retrieved and used.
Bioinformatics emphasize the multidisciplinary nature of the field and also
convey the nature of bioinformatics applications. Bioinformatics is becoming
increasingly important due to the interest of the pharmaceutical industry in genome
sequencing projects. There is a vital need to harness this information for the
medical diagnostic and therapeutic uses and there are opportunities for other

[19]
Bioinformatics Practical Manual K. C. Samal et al.

industrial applications. This field is evolving rapidly, which makes it challenging


for biotechnology professionals to keep up with recent advancements. The area has
evolved to deal with four distinct problem viz,
(a) Handling and management of biological data, including its organization,
control, linkages, analysis and so forth.
(b) Communication among people projects and institutions engaged in biological
research and applications.
(c) Organization, access, search and retrieval of biological information, documents
and literature.
(d) Analysis and interpretation of the biological data through the computational
approaches.

Bioinformatics has wide application in the following areas


1. Storage and organization of data
Bioinformatics is used to organized biological data to help the researches
to access information, add new information arising out of experiments and modify
existing information. For example protein Data Bank (PDB)
http://www.rcsb.org/pdb/) is the single worldwide repository for the processing
and distribution of 3-D biological macro molecular structure data.
2. Information Search and Retrieval:-
Information Search and retrieval is one of the most powerful applications of
bioinformatics. For example, Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/) is a
service of the National Library of Medicine. It includes links to many sites
providing full text articles and other related resources.
3. Sequence Comparison:-
One of the most useful and popular applications for the biologists is the
sequence comparison/ sequence alignment. BLAST and FASTA are two online
which can perform pair-wise comparison of sequences.

[20]
Bioinformatics Practical Manual K. C. Samal et al.

Multiple sequence alignment methods assemble pair wise sequence


alignments for many related sequence in to a picture of sequence homology among
all members of a gene family. Multiple sequence alignments aid in visual
identification of sites on a DNA or proteins that may be functionally important.
Those sites are usually conserved.
4. Linkage Analysis
Genological research and linkage analysis involves the analysis of a large
amount of data. Chromosomal location of genes this has important implications in
disease identification can be identified using linkage analysis. There are various on
line tools (http://linkage.rockefeller.edu/) are used for linkage analysis.
5. Comparative Genomics
The assumption that the similarity of two sequences whether it is
DNA,RNA or Protein implies functional correlation. Some of the most successful
bioinformatics applications is the sequence. alignment against large databases of
known sequence using online BLAST tools.
6. Functional Genomics
To investigate genes in their cellular context, expression analysis via
microarray and DNA-chips takes place.The comparision of expression patterens of
well defined metabolic states allows identifications of pathological phenotypes on a
molecular level.
7. Proteomics
The proteome refers to the identification and analysis of all proteins of a
cell. It involves the determination of protein interactions and biological pathways.
The publication of entire genome sequences led to a shift interests from DNA
Sequencing to protein localization and characterization within their cellular
context.
8. Structural Genomics
Structural Genomics covers the calculation of three dimensional structures
based on the sequence of a macromolecule. The theoretical basis of the relationship

[21]
Bioinformatics Practical Manual K. C. Samal et al.

between sequences and structure is the most fundamental problem of in silico


biology. The only knowledge of the structure of a protein can provide a deeper
understanding of its function.
9. Pharmaco Genomics
The development of drugs aims to maximize effect and minimize side
effects. The genetic variations among all human is only 0.1% of the total DNA.
This is called point mutation having phenotypic impact. These so. called SNPs
become good candidates for drug development and diagnosis.
10. Cellomics or system Biology:
If sufficient data is available and all relevant components for life are
identified more complex interactions can be investigated. For a holistic biological
understanding of cell, simulations of cells, entire organisms and population provide
new insights. The simulation of life in in-silico is a future directive for
bioinformatics that started now.
11. Phylogenetic Analysis:
Phylogenetic analysis attempts to describe the evolutionary relationship of
a group of sequences. The information in a molecular sequence alignment can be
used to compute a phylogenetic trees for a particular family of gene sequence. The
branching in phylogenetic tree represents the evolutionary based on sequence
similarity. Phylogenetic analysis of protein sequence families takes about the
evolution of entire organisms.
12. Primer Design
Many molecular biology protocols require the design of oligo-nucleotide
primers. Proper primer design is critical for the success of polymerase chain
reaction(PCR), oligo-hybridization, DNA sequencing and microarray experiments
primers must hybridize with the target DNA and in addition to it the primers have
following qualities: appropriate physico –chemical properties; they must not self
hybridize or dimerize; they should not have multiple targets within sequence
under investigation .

[22]
Bioinformatics Practical Manual K. C. Samal et al.

13. Constructing Evolutionary (Phylogenetic) Trees:


Biodiversity database are used to collect the species names, descriptions,
distribution, genetic information, status, and size of populations, habitat needs and
how each organism interacts with other species etc. Computer simulations models
are useful to study population dynamics or calculate the cumulative genetic health
of a breeding pool( in agriculture) or endangered population(in conservation).
Entire DNA sequences or genome of endangered species can be preserved,
allowing results of natures experiment to be remembered in silicon
There are two areas in biology where enormous amounts of information are
generated. One is in molecular biology which deals with base sequences in DNA
and amino acid sequences in protein and the other is the biodiversity information
crisis. Mathematics and computer are being used to tackle these problems with
procedures which come under the label of bioinformatics.
These trees are often constructed after comparing sequences belonging to
different organisms. Trees group the sequences according to their degree of
similarity. They serve as a guide to reasoning about how these sequences have been
transformed through evolution. For example, they infer homology from similarity,
and may rule out erroneous assumptions that contradict known evolutionary
processes.
14. Detecting Patterns in Sequences:
There are certain parts of DNA and amino acid sequences that need to be
detected. Two prime examples are the search for genes in DNA and the
determining of subcomponents of a sequence of amino acids (secondary structure).
There are several ways to perform these tasks. Many of them are based on machine
learning and include probabilistic grammars, or neural networks.
15. Determining 3-D Structures from Sequences:
The problems in bioinformatics that relate sequences to 3D structures are
computationally difficult. The determination of RNA shape from sequences
requires algorithms of cubic complexity. The inference of shapes of proteins from
amino acid sequences remains an unsolved problem.
[23]
Bioinformatics Practical Manual K. C. Samal et al.

16. Inferring Cell Regulation:


The function of a gene or protein is best described by its role in a metabolic
or signaling pathway. Genes interact with each other; proteins can also prevent or
assist in the production of other proteins. The available approximate models of cell
regulation can be either discrete or continuous. One usually distinguishes between
cell simulation and modeling. The latter amounts to inferring the former from
experimental data (say microarrays). This process is usually called reverse
engineering.
17. Determining Protein Function and Metabolic Pathways:
This is one of the most challenging areas of bioinformatics and for which
there is not considerable data readily available. The objective here is to interpret
human annotations for protein function and also to develop databases representing
graphs that can be queried for the existence of nodes (specifying reactions) and
paths (specifying sequences of reactions).
18. Assembling DNA Fragments:
Fragments provided by sequencing machines are assembled using
computers. The tricky part of that assemblage is that DNA has many repetitive
regions and the same fragment may belong to different regions. The algorithms for
assembling DNA are mostly used by large companies (like the former Celera). (8)
Using Script Languages. Many of the above applications are already available in
websites. Their usage requires scripting that provides data for an application,
receives it back, and then analyzes it. The algorithms required to perform the above
tasks are detailed in the following subsections. What differentiates bioinformatics
problems from others is the huge size of the data and its (sometimes questionable)
quality. That explains the need for approximate solutions. It should be remarked
that several of the problems in bioinformatics are constrained optimization
problems. The solution to those problems is usually computationally expensive.
One of the efficient known methods in optimization is dynamic programming. That
explains why this technique is often used in bioinformatics. Other approaches like
branch and- bound are also used, but they are known to have higher complexity

[24]
Bioinformatics Practical Manual K. C. Samal et al.

19. Drug designing:


It has applications in knowledge-based drug design. Computational studies
of protein–ligand interactions provide a rational basis for the rapid identification of
novel leads for synthetic drugs. Knowledge of the three-dimensional structures of
proteins allows molecules to be designed that are capable of binding to the receptor
site of a target protein with great affinity and specificity. This informatics-based
approach significantly reduces the time and cost necessary to develop drugs with
higher potency, fewer side effects, and less toxicity than using the traditional trial-
and-error approach. In the last two decades, tens of thousands of protein three
dimensional structure have been determined by X –ray crystallography and protein
nuclear magnetic resonance spectrograph (protein in NMR). One central question
for the biological scientists is whether it is practical to predict possible protein in
protein interaction only based on these 3D shapes, without doing protein in protein
interaction experiments. A variety of method have been developed to tackle the
protein – protein docking problem, though it seems that there is still much work to
be done in this field. We are interested in information about our DNA, protein and
the function of proteins. Genes and proteins can be sequenced, so the sequence of
bases in genes or amino acids in proteins can be determined. This information
must be store in an intelligent fashion, so scientists can solve problems quickly and
easily using all available information. Therefore, the information is stored in
databanks, many of which are accessible to everyone on the internet. A few
examples are a databank containing protein sequences and their function(the PDB
or protein data bank), a databank containing protein sequences and their
function(swiss-prot),a data bank with information about enzymes and their function
(enzyme),and a databank with nucleotide sequences of all genes sequenced up to
date(EMBL).
20. Human health care and Forensic science
In forensics, results from molecular phylogenetic analysis have been
accepted as evidence in criminal courts. Some sophisticated Bayesian statistics and
likelihood-based methods for analysis of DNA have been applied in the analysis of
forensic identity. It is worth mentioning that genomics and bioinformatics are now
[25]
Bioinformatics Practical Manual K. C. Samal et al.

poised to revolutionize our healthcare system by developing personalized and


customized medicine. The high speed genomic sequencing coupled with
sophisticated informatics technology will allow a doctor in a clinic to quickly
sequence a patient’s genome and easily detect potential harmful mutations and to
engage in early diagnosis and effective treatment of diseases.
21. Agriculture:
Bioinformatics tools are being used in agriculture as well. Plant genome
databases and gene expression profile analyses have played an important role in the
development of new crop varieties that have higher productivity and more
resistance to disease.

Bioinformatics in India
As per the recent study India will be a potential star in the field of
bioscience. In the coming years after considering the factors like bio-diversity,
human resources, and infra-structure facilities and governments initiatives.
Bioinformatics has been emerged out of the inputs from several different
areas such as biology, biochemistry, biophysics, molecular biology, biostatics and
computer science. Specially designed algorithms and organized database is the core
of all informatics operations. The requirements for such an activity make heavy and
high level demands on both the hardware and software capabilities. This sector is
the quickest growing field in the country. The vertical growth is because of the
linkage between IT and biotechnology, spurred by the human genome project. The
promising startups are already there in Bangalore, Hyderabad, Pune, Chennai and
Delhi. There are over 200 companies functioning in these places. IT majors such
Intel, IBM, Wipro are getting into this segments spurred by the promises in
technological developments.

Limitations
Having recognized the power of bioinformatics, it is also important to
realize its limitations and avoid over-reliance on and over-expectation of
bioinformatics output. In fact, bioinformatics has a number of inherent limitations.
In many ways, the role of bioinformatics in genomics and molecular biology
[26]
Bioinformatics Practical Manual K. C. Samal et al.

research can be likened to the role of intelligence gathering in battlefields.


Intelligence is clearly very important in leading to victory in a battlefield. Fighting
a battle without intelligence is inefficient and dangerous. Having superior
information and correct intelligence helps to identify the enemy’s weaknesses and
reveal the enemy’s strategy and intentions. The gathered information can then be
used in directing the forces to engage the enemy and win the battle. However,
completely relying on intelligence can also be dangerous if the intelligence is of
limited accuracy. Overreliance on poor-quality intelligence can yield costly
mistakes if not complete failures. It is no stretch in analogy that fighting diseases or
other biological problems using bioinformatics is like fighting battles within
diligence. Bioinformatics and experimental biology are independent, but
complementary, activities. Bioinformatics depends on experimental science to
produce raw data for analysis. It, in turn, provides useful interpretation of
experimental data and important leads for further experimental research.
Bioinformatics predictions are not formal proofs of any concepts. They do not
replace the traditional experimental research methods of actually testing
hypotheses. In addition, the quality of bioinformatics predictions depends on the
quality of data and the sophistication of the algorithms being used. Sequence data
from high throughput analysis often contain errors. If the sequences are wrong or
annotations incorrect, the results from the downstream analysis are misleading as
well. That is why it is so important to maintain a realistic perspective of the role of
bioinformatics.

[27]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 3
Databases and its structure
One of the hallmarks of modern genomic research is the generation of
enormous amounts of raw sequence data. As the volume of genomic data grows,
sophisticated computational methodologies are required to manage the data deluge.
Thus, the very first challenge in the genomics era is to store and handle the
staggering volume of information through the establishment and use of computer
databases. The development of databases to handle the vast amount of molecular
biological data is thus a fundamental task of bioinformatics. This chapter
introduces some basic concepts related to development and management of
databases. Biological databases are libraries of life sciences information, collected
from scientific experiments, published literature, high-throughput experiment
technology, and computational analyses. They contain information from research
areas including genomics, proteomics, metabolomics, microarray gene expression,
and phylogenetics. Information contained in biological databases includes gene
function, structure, localization (both cellular and chromosomal), clinical effects of
mutations as well as similarities of biological sequences and structures.
What is a database?
A database is a computerized archive used to store and organize data in
such a way that information can be retrieved easily via a variety of search criteria.
Databases are composed of computer hardware and software for data management.
The chief objective of the development of a database is to organize data in a set of
structured records to enable easy retrieval of information. Each record, also called
an entry, should contain a number of fields that hold the actual data items, for
example, fields for names, phone numbers, addresses, dates. To retrieve a particular
record from the database, a user can specify a particular piece of information,
called value, to be found in a particular field and expect the computer to retrieve
the whole data record. This process is called making a query. Although data
retrieval is the main purpose of all databases, biological databases often have a
higher level of requirement, known as knowledge discovery, which refers to the
identification of connections between pieces of information that were not known
[28]
Bioinformatics Practical Manual K. C. Samal et al.

when the information was first entered. For example, databases containing raw
sequence information can perform extra computational tasks to identify sequence
homology or conserved motifs. These features facilitate the discovery of new
biological insights from raw data.
Organization of databases:
Databases can be constructed either as flat files, relational, or object
oriented. Flat files are simple text files and lack any form of organization to
facilitate information retrieval by computers. Relational databases organize data as
tables and search information among tables with shared features. Object-oriented
databases organize data as objects and associate the objects according to
hierarchical relationships.
(a) Flat file database:
Originally, databases all used a flat file format, which is a long text file that
contains many entries separated by a delimiter, a special character such as a vertical
bar (|). Within each entry are a number of fields separated by tabs or commas.
Except for the raw values in each field, the entire text file does not contain any
hidden instructions for computers to search for specific information or to create
reports based on certain fields from each record. The text file can be considered a
single table. Thus, to search a flat file for a particular piece of information, a
computer has to read through the entire file, an obviously inefficient process. This
is manageable for a small database, but as database size increases or data types
become more complex, this database style can become very difficult for
information retrieval. Indeed, searches through such files often cause crashes of the
entire computer system because of the memory-intensive nature of the operation.
To facilitate the access and retrieval of data, sophisticated computer software
programs for organizing, searching, and accessing data have been developed. They
are called database management systems. These systems contain not only raw data
records but also operational instructions to help identify hidden connections among
data records. The purpose of establishing a data structure is for easy execution of
the searches and to combine different records to form final search reports.
Depending on the types of data structures, these database management systems can

[29]
Bioinformatics Practical Manual K. C. Samal et al.

be classified into two types: relational database management systems and object-
oriented database management systems. Consequently, databases employing these
management systems are known as relational databases or object-oriented
databases, respectively.

Name, State Course#, Course name,


Dhawale Rahmi, Maharastra, PPT-301, Plant pathology,
Gite Vikram Balaji, Bihar, ABT-510, Bioinformatics,
Nihar Ranjan, Odisha, ST-512, Statistics,
Hinge Shyam A,Rajastan, PBG-612, Plant breeding,
Thirat Suital Bansi, Maharastra, ABT-517, Microbiology,
Surve Ratnapal, Jharkhand, PP-621, Plant physiology,
Kalapad santosh, Keral , ENT-614, Entomology,
Kumara Swamy, Tamilnadu, AC-524,

Figure: Example of constructing a Flat file database for


eight students’ course information

(b) Relational Databases


Instead of using a single table as in a flat file database, relational databases
use a set of tables to organize data. Each table, also called a relation, is made up of
columns and rows. Columns represent individual fields. Rows represent values in
the fields of records. The columns in a table are indexed according to a common
feature called an attribute, so they can be cross-referenced in other tables. To
execute a query in a relational database, the system selects linked data items from
different tables and combines the information into one report. Therefore, specific
information can be found more quickly from a relational database than from a flat
file database. Relational databases can be created using a special programming
language called structured query language (SQL). The creation of this type of
databases can take a great deal of planning during the design phase. After creation
of the original database, a new data category can be easily added without requiring
all existing tables to be modified. The subsequent database searching and data
gathering for reports are relatively straightforward. Here is a simple example of
student course information expressed in a flat file which contains records of five
students from four different states, each taking a different course. Each data record,
separated by a vertical bar, contains four fields describing the name, state, course
[30]
Bioinformatics Practical Manual K. C. Samal et al.

number and title. A relational database is also created to store the same
information, in which the data are structured as a number of tables. In each table,
data that fit a particular criterion are grouped together. Different tables can be
linked by common data categories, which facilitate finding of specific information
Relational database
Table A Table B Table C
Student Student Course
Name State Course# Course#
No# No# name
1 Dhawale Rahmi Maharashtra 1 PPT-301 PPT-301 Plant pathology

2 Gite Vikram Balaji Bihar 2 ABT-510 ABT-510 Bioinformatics


3 Nihar Ranjan Odisha 3 ST-512 ST-512 Statistics

4 Hinge Shyam A Rajasthan 4 PBG-612 PBG-612 Plant breeding


5 Thirat Suital Bansi Maharashtra 5 ABT-517 ABT-517 Microbiology
6 Surve Ratnapal Jharkhand 6 PP-621 PP-621 Plant physiology

7 Kalapad santosh Kerala 7 ENT-614 ENT-614 Entomology


8 Kumara Swamy Tamil Nadu 8 AC-524 AC-524 Soil chemistry
Figure Example of constructing a relational database for eight students’ course information
originally expressed in a flat file. By creating three different tables linked by common
fields, data can be easily accessed and reassembled

For example, if one is to ask the question, which courses are students from
the state ‘Maharashtra’ taking? The database will first find the field for “State” in
Table A and look up for ‘Maharashtra’. This returns students 1 and 5. The student
numbers are co listed in Table B, in which students 1 and 5 correspond to PPT-301
and ABT-517, respectively. The course names listed by course numbers are found
in Table C. By going to Table C, exact course names corresponding to the course
numbers can be retrieved. A final report is then given showing that the students of
‘Maharashtra’ are taking the courses ‘Plant pathology’ and ‘Microbiology’.
However, executing the same query through the flat file requires the computer to
read through the entire text file word by word and to store the information in a
temporary memory space and later mark up the data records containing the word
‘Maharashtra’. This is easily accomplishable for a small database. To perform

[31]
Bioinformatics Practical Manual K. C. Samal et al.

queries in a large database using flat files obviously becomes enormous task for the
computer system.
Object-Oriented Databases
One of the problems with relational databases is that the tables used do not
describe complex hierarchical relationships between data items. To overcome the
problem, object-oriented databases have been developed that store data as objects.
In an object-oriented programming language, an object can be considered as a unit
that combines data and mathematical routines that act on the data. The database is
structured such that the objects are linked by a set of pointers defining
predetermined relationships between the objects. Searching the database involves
navigating through the objects with the aid of the pointers linking different objects.
Programming languages like C++ are used to create object-oriented databases. The
object-oriented database system is more flexible; data can be structured based on
hierarchical relationships. By doing so, programming tasks can be simplified for
data that are known to have complex relationships, such as protein structure data.
In this case, three objects are constructed and are linked by pointers shown
as arrows. Finding specific information relies on navigating through the objects by
way of pointers. For simplicity, some of the pointers are omitted. this type of
database system lacks the rigorous mathematical foundation of the relational
databases. There is also a risk that some of the relationships between objects maybe
misrepresented. Some current databases have therefore incorporated features of
both types of database programming, creating the object–relational database
management system. The above students’ course information can be used to
construct an object-oriented database. Three different objects can be designed:
student object, course object, and state object. Their interrelations are indicated by
lines with arrows. To answer the same question – which courses are students from
‘Maharashtra’ taking – one simply needs to start from ‘Maharashtra’ in the state
object, which has pointers that lead to students, 1 and 5 in the student object.
Further pointers in the student object point to the course each of the two students is
taking. Therefore, a simple navigation through the linked objects provides a final
report.

[32]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 4
Biological Databases
Based on their content, biological databases are divided into primary,
secondary, and specialized databases. Primary databases simply archive sequence
or structure information; secondary databases include further analysis on the
sequences or structures. Specialized databases cater to a particular research interest.
Current biological databases use all three types of database structures: flat
files, relational, and object oriented. Despite the obvious drawbacks of using flat
files in database management, many biological databases still use this format. The
justification for this is that this system involves minimum amount of database
design and the search output can be easily understood by working biologists.
(I) Primary Databases
There are three major public sequence databases that store raw nucleic acid
sequence data produced and submitted by researchers worldwide: GenBank,
European Molecular Biology Laboratory (EMBL) database and DNA Data Bank of
Japan (DDBJ), which are all freely available on the Internet. Most of the data in the
databases are contributed directly by authors with a minimal level of annotation. A
small number of sequences, especially those published in the 1980s, were entered
manually from published literature by database management staff. Presently,
sequence submission to GenBank, EMBL, or DDBJ is a precondition for
publication in most scientific journals to ensure the fundamental molecular data to
be made freely available. These three public databases closely collaborate and
exchange new data daily. They together constitute the International Nucleotide
Sequence Database Collaboration. This means that by connecting to any one of the
three databases, one should have access to the same nucleotide sequence data.
Although the three databases all contain the same sets of raw data, each of the
individual databases has a slightly different kind of format to represent the data.
Fortunately, for the three-dimensional structures of biological macromolecules,
there is only one centralized database, the PDB. This database archives atomic

[33]
Bioinformatics Practical Manual K. C. Samal et al.

coordinates of macromolecules (both proteins and nucleic acids) determined by x-


ray crystallography and NMR. It uses a flat file format to represent protein name,
authors, experimental details, secondary structure, cofactors, and atomic
coordinates. The web interface of PDB also provides viewing tools for simple
image manipulation.
(a) GenBank
GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism. The content includes genomic DNA,
mRNA, cDNA, ESTs, high throughput raw sequence data, and sequence
polymorphisms. There are two ways to search for sequences in GenBank. One is
using text-based keywords similar to a PubMed search. The other is using
molecular sequences to search by sequence similarity using BLAST.
GenBank Sequence Format
To search GenBank effectively using the text-based method requires an
understanding of the GenBank sequence format. GenBank is a relational database.
However, the search output for sequence files is produced as flat files for easy
reading. The resulting flat files contain three sections – Header, Features, and
Sequence entry. There are many fields in the Header and Features sections. Each
field has a unique identifier for easy indexing by computer software. Understanding
the structure of the GenBank files helps in designing effective search strategies.
Header section:
The “Header section” describes the origin of the sequence, identification of
the organism, and unique identifiers associated with the record. The top line of the
Header section is the Locus, which contains a unique database identifier for a
sequence location in the database (not a chromosome locus).
The identifier is followed by sequence length and molecule type (e.g., DNA
or RNA). This is followed by a three-letter code for GenBank divisions. There are
17 divisions in total, which were set up simply based on convenience of data
storage without necessarily having rigorous scientific basis; for example, PLN for

[34]
Bioinformatics Practical Manual K. C. Samal et al.

plant, fungal, and algal sequences; PRI for primate sequences; MAM for non-
primate mammalian sequences; BCT for bacterial sequences; and EST for EST
sequences. Next to the division is the date when the record was made public (which
is different from the date when the data were submitted). The following line,
“DEFINITION,” provides the summary information for the sequence record
including the name of the sequence, the name and taxonomy of the source
organism if known, and whether the sequence is complete or partial. This is
followed by an accession number for the sequence, which is a unique number
assigned to a piece of DNA when it was first submitted to GenBank and is
permanently associated with that sequence. This is the number that should be cited
in publications. It has two different formats: two letters with five digits or one letter
with six digits. For a nucleotide sequence that has been translated into a protein
sequence a new accession number is given in the form of a string of alphanumeric
characters. In addition to the accession number, there is also a version number and
a gene index (gi) number. The purpose of these numbers is to identify the current
version of the sequence. If the sequence annotation is revised at a later date, the
accession number remains the same, but the version number is incremented as is
the gi number. A translated protein sequence also has a different gi number from
the DNA sequence it is derived from.
The next line in the Header section is the “ORGANISM” field, which
includes the source of the organism with the scientific name of the species and
sometimes the tissue type. Along with the scientific name is the information of
taxonomic classification of the organism. Different levels of the classification are
hyperlinked to the NCBI taxonomy database with more detailed descriptions. This
is followed by the “REFERENCE” field, which provides the publication citation
related to the sequence entry. The REFERENCE part includes author and title
information of the published work (or tentative title for unpublished work). The
“JOURNAL” field includes the citation information as well as the date of sequence
submission. The citation is often hyperlinked to the PubMed record for access to
the original literature information. The last part of the Header is the contact
information of the sequence submitter.
[35]
Bioinformatics Practical Manual K. C. Samal et al.

Humanliver glucokinase (ATP:D-hexose 6-phosphotransferase) mRNA, complete cds


GenBank: M69051.1
LOCUS HUMGKNASE 2550 bp mRNA linear PRI 29-SEP-1995
DEFINITION Human liver glucokinase (ATP:D-hexose 6-phosphotransferase) mRNA,
complete cds.
ACCESSION M69051
VERSION M69051.1 GI:183226
KEYWORDS ATP:D-hexose 6-phosphotransferase; glucokinase.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 2550)
AUTHORS Tanizawa,Y., Koranyi,L.I., Welling,C.M. and Permutt,M.A.
TITLE Human liver glucokinase gene: cloning and sequence determination of
two alternatively spliced cDNAs
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 88 (16), 7294-7297 (1991)
PUBMED 1871135
COMMENT Original source text: Homo sapiens male adult liver cDNA to mRNA.
FEATURES Location/Qualifiers
source 1..2550
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
gene 1..2550
/gene="GLK"
misc_feature 156..1680
/gene="GLK"
exon 204..327
/gene="GLK"
/product="glucokinase"
/note="ATP:D-hexose 6-phosphotransferase; cassette exon"
/EC_number="2.7.1.1"
CDS 286..1680
/gene="GLK"
/EC_number="2.7.1.1"
/product="glucokinase"
/protein_id="AAB59563.1"
/db_xref="GI:183227"
/translation="MPRPRSQLPQPNSQVEQILAEFQLQEEDLKKVMRRMQKEMDRGL
RLETHEEASVKMLPTYVRSTPEGSEVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKT
……………….
RESRSEDVMRITVGVDGSVYKLHPSFKERFHASVRRLTPSCEITFIESEEGSGRGAAL
VSAVACKKACMLGQ"
variation 602
/gene="GLK"
/note="'t' is common in the population; 'c' or 't'"
/replace="t"
ORIGIN
1 aagccctggg ctgccagcct caggcagctc tccatccaag cagccgttgc tgccacaggc
61 gggccttacg ctccaaggct acagcatgtg ctaggcctca gcaggcagga gcatctctgc
121 ctcccaaagc atctacctct tagcccctcg gagagatggc gatggatgtc acaaggagcc
…………………………………..
2461 ccccatcata tgacatgcca ccctctccat gcccaaccta agattgtgtg ggttttttaa
2521 ttaaaaatgt taaaagtttt aaaaaaaaaa
//

Figure: Genbank database


[36]
Bioinformatics Practical Manual K. C. Samal et al.

Features section
The “Features” section includes annotation information about the gene and
gene product, as well as regions of biological significance reported in the sequence,
with identifiers and qualifiers. The “Source” field provides the length of the
sequence, the scientific name of the organism, and the taxonomy identification
number. Some optional information includes the clone source, the tissue type and
the cell line. The “gene” field is the information about the nucleotide coding
sequence and its name. For DNA entries, there is a “CDS” field, which is
information about the boundaries of the sequence that can be translated into amino
acids. For eukaryotic DNA, this field also contains information of the locations of
exons and translated protein sequences are entered. The third section of the flat file
is the sequence itself starting with the label “ORIGIN.” The format of the sequence
display can be changed by choosing options at a Display pull-down menu at the
upper left corner. For DNA entries, there is a BASE COUNT report that includes
the numbers of A, G, C, and T in the sequence. This section, for both DNA and
protein sequences, ends with two forward slashes (the “//” symbol). In retrieving
DNA or protein sequences from GenBank, the search can be limited to different
fields of annotation such as “organism,” “accession number,” “authors,” and
“publication date.” Alternatively, a number of search qualifiers can be used, each
defining one of the fields in a GenBank file. The qualifiers are similar to but not the
same as the field tags in PubMed. For example, in GenBank, [GENE] represents
field for gene name, [AUTH] for author name, and [ORGN] for organism name.
Alternative Sequence Formats FASTA
In addition to the GenBank format, there are many other sequence formats.
FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs. It has a single definition line that begins with a right angle
bracket (>) followed by a sequence name. Sometimes, extra information such as gi
number or comments can be given, which are separated from the sequence name by
a “|” symbol. The extra information is considered optional and is ignored by
sequence analysis programs. The plain sequence in standard one-letter symbols
[37]
Bioinformatics Practical Manual K. C. Samal et al.

starts in the second line. Each line of sequence data is limited to sixty to eighty
characters in width. The drawback of this format is that much annotation
information is lost.

Human liver glucokinase (ATP:D-hexose 6-phosphotransferase)


mRNA, complete cds
GenBank: M69051.1
GenBank Graphics
>gi|183226|gb|M69051.1|HUMGKNASE Human liver glucokinase (ATP:D-hexose 6-
phosphotransferase) mRNA, complete cds
AAGCCCTGGGCTGCCAGCCTCAGGCAGCTCTCCATCCAAGCAGCCGTTGCTGCCACAGGCGGGCCTTACG
CTCCAAGGCTACAGCATGTGCTAGGCCTCAGCAGGCAGGAGCATCTCTGCCTCCCAAAGCATCTACCTCT
TAGCCCCTCGGAGAGATGGCGATGGATGTCACAAGGAGCCAGGCCCAGACAGCCTTGACTCTGCCAGACT
CTCCTCTGAACTCGGGCCTCACATGGCCAACTGCTACTTGGAACAAATCGCCCCTTGGCTGGCAGATGTG
TTAACATGCCCAGACCAAGATCCCAACTCCCACAACCCAACTCCCAGGTAGAGCAGATCCTGGCAGAGTT
CCAGCTGCAGGAGGAGGACCTGAAGAAGGTGATGAGACGGATGCAGAAGGAGATGGACCGCGGCCTGAGG
CTGGAGACCCATGAAGAGGCCAGTGTGAAGATGCTGCCCACCTACGTGCGCTCCACCCCAGAAGGCTCAG
AAGTCGGGGACTTCCTCTCCCTGGACCTGGGTGGCACTAACTTCAGGGTGATGCTGGTGAAGGTGGGAGA
AGGTGAGGAGGGGCAGTGGAGCGTGAAGACCAAACACCAGACGTACTCCATCCCCGAGGACGCCATGACC
GGCACTGCTGAGATGCTCTTCGACTACATCTCTGAGTGCATCTCCGACTTCCTGGACAAGCATCAGATGA
AACACAAGAAGCTGCCCCTGGGCTTCACCTTCTCCTTTCCTGTGAGGCACGAAGACATCGATAAGGGCAT
CCTTCTCAACTGGACCAAGGGCTTCAAGGCCTCAGGAGCAGAAGGGAACAATGTCGTGGGGCTTCTGCGA
GACGCTATCAAACGGAGAGGGGACTTTGAAATGGATGTGGTGGCAATGGTGAATGACACGGTGGCCACGA
TGATCTCCTGCTACTACGAAGACCATCAGTGCGAGGTCGGCATGATCGTGGGCACGGGCTGCAATGCCTG
CTACATGGAGGAGATGCAGAATGTGGAGCTGGTGGAGGGGGACGAGGGCCGCATGTGCGTCAATACCGAG
TGGGGCGCCTTCGGGGACTCCGGCGAGCTGGACGAGTTCCTGCTGGAGTATGACCGCCTGGTGGACGAGA
GCTCTGCAAACCCCGGTCAGCAGCTGTATGAGAAGCTCATAGGTGGCAAGTACATGGGCGAGCTGGTGCG
GCTTGTGCTGCTCAGGCTCGTGGACGAAAACCTGCTCTTCCACGGGGAGGCCTCCGAGCAGCTGCGCACA
CGCGGAGCCTTCGAGACGCGCTTCGTGTCGCAGGTGGAGAGCGACACGGGCGACCGCAAGCAGATCTACA
ACATCCTGAGCACGCTGGGGCTGCGACCCTCGACCACCGACTGCGACATCGTGCGCCGCGCCTGCGAGAG
CGTGTCTACGCGCGCTGCGCACATGTGCTCGGCGGGGCTGGCGGGCGTCATCAACCGCATGCGCGAGAGC
CGCAGCGAGGACGTAATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCCAGCTTCA
AGGAGCGGTTCCATGCCAGCGTGCGCAGGCTGACGCCCAGCTGCGAGATCACCTTCATCGAGTCGGAGGA
GGGCAGTGGCCGGGGCGCGGCCCTGGTCTCGGCGGTGGCCTGTAAGAAGGCCTGTATGCTGGGCCAGTGA
GAGCAGTGGCCGCAAGCGCAGGGAGGATGCCACAGCCCCACAGCACCCAGGCTCCATGGGGAAGTGCTCC
CCACACGTGCTCGCAGCCTGGCGGGGCAGGAGGCCTGGCCTTGTCAGGACCCAGGCCGCCTGCCATACCG
CTGGGGAACAGAGCGGGCCTCTTCCCTCAGTTTTTCGGTGGGACAGCCCCAGGGCCCTAACGGGGGTGCG
GCAGGAGCAGGAACAGAGACTCTGGAAGCCCCCCACCTTTCTCGCTGGAATCAATTTCCCAGAAGGGAGT
TGCTCACTCAGGACTTTGATGCATTTCCACACTGTCAGAGCTGTTGGCCTCGCCTGGGCCCAGGCTCTGG
GAAGGGGTGCCCTCTGGATCCTGCTGTGGCCTCACTTCCCTGGGAACTCATCCTGTGTGGGGAGGCAGCT
CCAACAGCTTGACCAGACCTAGACCTGGGCCAAAAGGGCAGGCCAGGGGCTGCTCATCACCCAGTCCTGG
CCATTTTCTTGCCTGAGGCTCAAGAGGCCCAGGGAGCAATGGGAGGGGGCTCCATGGAGGAGGTGTCCCA
AGCTTTGAATACCCCCCAGAGACCTTTTCTCTCCCATACCATCACTGAGTGGCTTGTGATTCTGGGATGG
ACCCTCGCAGCAGGTGCAAGAGACAGAGCCCCCAAGCCTCTGCCCCAAGGGGCCCACAAAGGGGAGAAGG
GCCAGCCCTACATCTTCAGCTCCCATAGCGCTGGCTCAGGAAGAAACCCCAAGCAGCATTCAGCACACCC
CAAGGGACAACCCCATCATATGACATGCCACCCTCTCCATGCCCAACCTAAGATTGTGTGGGTTTTTTAA
TTAAAAATGTTAAAAGTTTTAAAAAAAAAA

Figure: DNA sequence in FASTA format

[38]
Bioinformatics Practical Manual K. C. Samal et al.

Conversion of Sequence Formats


In sequence analysis and phylogenetic analysis, there is a frequent need to
convert between sequence formats. One of them the most popular computer
programs for sequence format conversion is Read seq, written by Don Gilbert at
Indiana University. It recognizes sequences in almost any format and writes a new
file in an alternative format. The web interface version of the program can be found
at: http://iubio.bio.indiana.edu/ cgi-bin/readseq.cgi/.

(II) Secondary Databases


Sequence annotation information in the primary database is often minimal.
To turn the raw sequence information into more sophisticated biological
knowledge, much post processing of the sequence information is needed. This begs
the need for secondary databases, which contain computationally processed
sequence information derived from the primary databases. The amount of
computational processing work varies greatly among the secondary databases;
some are simple archives of translated sequence data from identified open reading
frames in DNA, whereas others provide additional annotation and information
related to higher levels of information regarding structure and functions. The
different secondary databases are TrEMBL, SWISSPROT. There are also
secondary databases that relate to protein family classification according to
functions or structures. The Pfam and Blocks databases contain aligned protein
sequence information as well as derived motifs and patterns, which can be used for
classification of protein families and inference of protein functions. The DALI
database is a protein secondary structure database that is vital for protein structure
classification and threading analysis o identify distant evolutionary relationships
among proteins.
SWISS-PROT
A prominent example of secondary databases is SWISS-PROT
((http://www.expasy.ch/), which provides detailed sequence annotation that
includes structure, function, and protein family assignment. SWISS-PROT is an
annotated protein sequence database, which was created at the Department of

[39]
Bioinformatics Practical Manual K. C. Samal et al.

Medical Biochemistry of the University of Geneva and has been a collaborative


effort of the Department and the European Molecular Biology Laboratory (EMBL),
since 1987. SWISS-PROT is now an equal partnership between the EMBL and the
Swiss Institute of Bioinformatics (SIB). The EMBL activities are carried out by its
Hinxton Outstation, the European Bioinformatics Institute (EBI).
The SWISS-PROT protein sequence database consists of sequence entries.
Sequence entries are composed of different line types, each with their own format.
For standardization purposes the format of SWISS-PROT follows as closely as
possible that of the EMBL Nucleotide Sequence Database. A sample SWISS-
PROT entry is shown below.
The SWISS-PROT database distinguishes itself from other protein
sequence databases by three distinct criteria: (i) annotations, (ii) minimal
redundancy and (iii) integration with other databases. The sequence data are mainly
derived from TrEMBL, a database of translated nucleic acid sequences stored in the
EMBL database. The annotation of each entry is carefully curated by human
experts and thus is of good quality. The protein annotation includes function,
domain structure, catalytic sites, cofactor binding, posttranslational modification,
metabolic pathway information, disease association, and similarity with other
sequences. Much of this information is obtained from scientific literature and
entered by database curators. The annotation provides significant added value to
each original sequence record. The data record also provides cross-referencing
links to other online resources of interest. Other features such as very low
redundancy and high level of integration with other primary and secondary
databases make SWISS-PROT very popular among biologists. A recent effort to
combine SWISS-PROT, TrEMBL, and PIR led to the creation of the UniProt
database, which has larger coverage than any one of the three databases while at the
same time maintaining the original SWISS-PROT feature of low redundancy,
cross-references, and a high quality of annotation.

[40]
Bioinformatics Practical Manual K. C. Samal et al.

Rhodopsin [Homo sapiens]


NCBI Reference Sequence: NP_000530.1
LOCUS NP_000530 348 aa linear PRI 15-MAR-2014
DEFINITION rhodopsin [Homo sapiens].
ACCESSION NP_000530
VERSION NP_000530.1 GI:4506527
DBSOURCE REFSEQ: accession NM_000539.3
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (residues 1 to 348)
AUTHORS Opefi CA, South K, Reynolds CA, Smith SO and Reeves PJ.
TITLE Retinitis pigmentosa mutants provide insight into the role of the
N-terminal cap in rhodopsin folding, structure, and function
JOURNAL J. Biol. Chem. 288 (47), 33912-33926 (2013)
PUBMED 24106275
REMARK GeneRIF: Retinitis pigmentosa mutants provide insight into the role
of the N-terminal cap in rhodopsin folding, structure, and
function.
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC080007.26.
This sequence is a reference standard in the RefSeqGene project.
Summary: Retinitis pigmentosa is an inherited progressive disease
which is a major cause of blindness in western communities. This
is the transmembrane protein which, when
photoexcited, initiates the visual transduction cascade. Defects in
this gene are also one of the causes of congenital stationary night
blindness. [provided by RefSeq, Jul 2008].

FEATURES Location/Qualifiers
source 1..348
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="3"
/map="3q21-q24"
Protein 1..348
/product="rhodopsin"
/note="opsin 2, rod pigment; opsin-2"
/calculated_mol_wt=38762
Region 2..37
/region_name="Rhodopsin_N"
/note="Amino terminal of the G-protein receptor rhodopsin;
pfam10413"
/db_xref="CDD:150994"
Site 37..61
/site_type="transmembrane region"
CDS 1..348
/gene="RHO"
/gene_synonym="CSNBAD1; OPN2; RP4"
/coded_by="NM_000539.3:96..1142"
/db_xref="CCDS:CCDS3063.1"
/db_xref="GeneID:6010"
/db_xref="HGNC:10012"
/db_xref="HPRD:01584"
/db_xref="MIM:180380"
ORIGIN
1 mngtegpnfy vpfsnatgvv rspfeypqyy laepwqfsml aaymfllivl gfpinfltly
61 vtvqhkklrt plnyillnla vadlfmvlgg ftstlytslh gyfvfgptgc nlegffatlg
121 geialwslvv laieryvvvc kpmsnfrfge nhaimgvaft wvmalacaap plagwsryip
181 eglqcscgid yytlkpevnn esfviymfvv hftipmiiif fcygqlvftv keaaaqqqes
241 attqkaekev trmviimvia flicwvpyas vafyifthqg snfgpifmti paffaksaai
301 ynpviyimmn kqfrncmltt iccgknplgd deasatvskt etsqvapa
//
Figure: Swissprot protein database (storing information of Rhodopsin protein)
[41]
Bioinformatics Practical Manual K. C. Samal et al.

TrEMBL:
To accommodate the growing influx of protein sequences without
compromising the quality of SWISS-PROT, the protein translations of the EMBL
nucleotide sequences that have not been properly curated by human annotators are
put into a supplemental database, TrEMBL (Translated EMBL,
http://www.expasy.org/sprot). This database serves as a kind of purgatory (or a
“halfway house”) for SWISS-PROT. Each TrEMBL entry is assigned a SWISS-
PROT-type accession number that would stay with it when the sequence is finally
manually checked and accepted into SWISS-PROT. To simplify curation, TrEMBL
entries are even formatted in the SWISS-PROT style. However, one should be alert
to the fact that TrEMBL entries are generated automatically, so their quality is not
guaranteed and their annotations should not be considered as solid as those of
authentic SWISS-PROT entries.
PIR:
The PIR (Protein Information Resource, http://pir.georgetown.edu) database
is an outgrowth of the Protein Sequence Database, originally created by Margaret
Dayhoff, and is currently maintained at the Georgetown University in collaboration
with Munich Information Center for Protein Sequences (MIPS,
http://mips.gsf.de/proj/protseqdb) in Munich, Germany and the Japanese
International Protein Information Database. While technically also a curated
database, PIR is far less rigorous than SWISS-PROT in maintaining the quality of
its annotations The advantage of PIR, however, is in its hierarchical organization.
The definitions of protein family and super-family employed in PIR are far more
narrow than those used in most of the other protein databases, particularly motif-
based and structure-based ones. Thus, PIR super-families are often composed of
very similar proteins, which may be treated by other databases as members of the
same family. As a result, more distant relations between proteins (the least trivial
and therefore the most interesting ones) are often not represented in PIR at all.
Recently, PIR has intensified its protein classification efforts with the creation of
iProClass (http://pir.georgetown.edu/iproclass, a protein classification database.

[42]
Bioinformatics Practical Manual K. C. Samal et al.

(III) Specialized Databases


Specialized databases normally serve a specific researchcommunityor focus
on a particular organism. The content of these databases may be sequences or other
types of information. The sequences in these databases may overlap with a primary
database, but may also have new data submitted directly by authors. Because they
are often curated by experts in the field, they may have unique organizations and
additional annotations associated with the sequences. Many genome databases that
are taxonomic specific fall within this category. Examples include
Flybase,WormBase, AceDB, and TAIR. In addition, there are also specialized
databases that contain original data derived from functional analysis. For example,
GenBank EST database and Microarray Gene Expression Database at the European
Bioinformatics Institute (EBI) are some of the gene expression databases available.
EcoGene:
The EcoGene database provides a set of gene and protein sequences derived
from the genome sequence of Escherichia coli K-12. EcoGene is a source of re-
annotated sequences for the SWISS-PROT and Colibri databases. EcoGene is used
for genetic and physical map compilations in collaboration with the Coli Genetic
Stock Center. The EcoGene12 release includes 4293 genes. A literature survey
identified 717 proteins whose N-terminal amino acids have been verified by
sequencing. Users can search and retrieve individual EcoGene Pages or they can
download large datasets for incorporation into database management systems,
facilitating various genome-scale computational and functional analyses.
Saccharomyces Genome Database (SGD)
The Saccharomyces Genome Database (SGD) provides comprehensive
integrated biological information for the budding yeast Saccharomyces cerevisiae
along with search and analysis tools to explore these data, enabling the discovery of
functional relationships between sequence and gene products in fungi and higher
organisms. Researchers studying larger organisms, including models such as
Drosophila and Caenorhabditis, as well as plants and humans, represent growing
communities that look to SGD for information when their research leads to genes
with similarity to one of the many that are already well characterized in yeast.

[43]
Bioinformatics Practical Manual K. C. Samal et al.

Educators and students in genetics and cellular biology comprise another large
community that SGD serves, as do bioinformatics scientists who perform genome-
wide computational analyses, for either yeast or comparative studies.
ACeDB:
ACeDB is a genome database system started in 1989 by Jean Thierry-Mieg
(CNRS, Montpellier) and Richard Durbin (Sanger Institute). It was originally
developed for the Caenorhabditis elegans genome project from which its name was
derived: A C. elegans DataBase. However the tools in it have been generalized to
be much more flexible and the same software is now used for many different
genomic databases from bacteria to fungi to plants to man.
Arabidopsis Information Resource (TAIR):
The Arabidopsis Information Resource (TAIR) collects information and
maintains a database of genetic and molecular biology data for Arabidopsis
thaliana, a widely used model plant. TAIR is managed by the nonprofit Phoenix
Bioinformatics Corporation and is supported through institutional, lab and personal
subscriptions. Prior funding was provided by the National Science Foundation.
The data in TAIR can be searched, viewed using our GBrowse or interactive
SeqViewer genome browsers.
FlyBase:
FlyBase is an online bioinformatics database and the primary repository of
genetic and molecular data of the extensively studied species and model organism,
Drosophila melanogaster. A wide range of data are presented in different formats.
Information in FlyBase originates from a variety of sources ranging from large-
scale genome projects to the primary research literature. These data types include
mutant phenotypes, molecular characterization of mutant alleles and other
deviations, cytological maps, wild-type expression patterns, anatomical images,
transgenic constructs and insertions, sequence-level gene models and molecular
classification of gene product functions. Query tools allow navigation of FlyBase
through DNA or protein sequence, by gene or mutant name, or through functional,
phenotypic, and anatomical data. The database offers several different query tools
in order to provide efficient access to the data available and facilitate the discovery

[44]
Bioinformatics Practical Manual K. C. Samal et al.

of significant relationships within the database. The FlyBase project is carried out
by a consortium of Drosophila researchers and computer scientists at Harvard
University and Indiana University in the United States, and University of
Cambridge in the United Kingdom.
Gramene:
The Gramene (http://www.gramene.org/) is a curated, open-source,
integrated data resource for comparative functional genomics in crops and model
plant species. The Gramene database became a resource for major model and crop
plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in
addition to several species of rice. Gramene began with the addition of an Ensembl
genome browser and has expanded in the last decade to become a robust resource
for plant genomics hosting a wide array of data sets including quantitative trait loci
(QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm,
literature, ontologies and a fully-structured markers and sequences database
integrated with genome browsers and maps from various published studies
(genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web
services including a Distributed Annotation Server (DAS), BLAST and a public
MySQL database. Twice a year, Gramene releases a major build of the database
and makes interim releases to correct errors or to make important updates to
software and/or data. Gramene currently hosts annotated whole genomes in over
two dozen plant species and partial assemblies for almost a dozen wild rice species.
Online Mendelian Inheritance in Man (OMIM)
Online Mendelian Inheritance in Man (OMIM) is a timely, authoritative
compendium of bibliographic material and observations on inherited disorders and
human genes. It is the continuously updated. Curation of the database and editorial
decisions take place at The Johns Hopkins University School of Medicine. OMIM
provides authoritative free text overviews of genetic disorders and gene loci that
can be used by clinicians, researchers, students, and educators. In addition, OMIM
has many rich connections to relevant primary data resources such as bibliographic,
sequence, and map information.

[45]
Bioinformatics Practical Manual K. C. Samal et al.

Interconnection between Biological Databases


As mentioned, primary databases are central repositories and distributors of
raw sequence and structure information. They support nearly all other types of
biological databases. Therefore, in the biological community, there is a frequent
need for the secondary and specialized databases to connect to the primary
databases and to keep uploading sequence information. In addition, a user often
needs to get information from both primary and secondary databases to complete a
task because the information in a single database is often insufficient. Instead of
letting users visiting multiple databases, it is convenient for entries in a database to
be cross-referenced and linked to related entries in other databases that contain
additional information. All these create a demand for linking different databases.
The main barrier to linking different biological databases is format incompatibility
current biological databases utilize all three types of database structures – flat files,
relational, and object oriented. The heterogeneous database structures limit
communication between databases. One solution to networking the databases is to
use a specification language called Common Object Request Broker
Architecture (COBRA), which allows database programs at different locations to
communicate in a network through an “interface broker” without having to
understand each other’s database structure. It works in a way similar to Hyper Text
Markup Language (HTML) for web pages, labeling database entries using a set of
common tags. A similar protocol called eXtensible Markup Language (XML) also
helps in bridging databases. In this format, each biological record is broken down
into small, basic components that are labeled with a hierarchical nesting of tags.
This database structure significantly improves the distribution and exchange of
complex sequence annotations between databases. Recently, a specialized protocol
for bioinformatics data exchange has been developed. It is the distributed
annotation system, which allows one computer to contact multiple servers and
retrieve dispersed sequence annotation information related to a particular sequence
and integrate the results into a single combined report.

[46]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 5
Database retrieval system
Databases are fundamental to modern biological research, especially to
genomic studies. The goal of a biological database is twofold: information retrieval
and knowledge discovery.
Entrez:
The Entrez (http://www.ncbi.nlm.nih.gov/) is a powerful federated search
engine, or web portal that allows users to search for scientific information, DNA,
RNA and protein sequences, structures, and bibliographic references. It is a part of
the National Library of Medicine (NLM), which is itself a department of the
National Institutes of Health (NIH), which in turn is a part of the United States
Department of Health and Human Services. The name "Entrez" (a greeting
meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the
public to search the content available from the NLM.
Entrez Global Query is an integrated search and retrieval system that
provides access to all databases simultaneously with a single query string and user
interface. Entrez can efficiently retrieve related sequences, structures, and
references. The Entrez system can provide views of gene and protein sequences and
chromosome maps. Some textbooks are also available online through the Entrez
system. The databases accessible through Entrez are among the most integrated
databases. Effective information retrieval involves the use of Boolean operators
(AND, OR, NOT). Entrez has additional user-friendly features to help conduct
complex searches. One such option is to use Limits, Preview/Index, and History to
narrow down the search space. Alternatively, one can use NCBI-specific field
qualifiers to conduct searches. To retrieve sequence information from NCBI
GenBank, an understanding of the format of GenBank sequence files is necessary.
It is also important to bear in mind that sequence data in these databases are less
than perfect. There are sequence and annotation errors. Biological databases are
also plagued by redundancy problems. There are various solutions to correct

[47]
Bioinformatics Practical Manual K. C. Samal et al.

annotation and reduce redundancy, for example, merging redundant sequences into
a single entry or store highly redundant sequence.
Sequence retrieval system
Sequence retrieval system (SRS; available at http://srs6.ebi.ac.uk/) is a
retrieval system maintained by the EBI, which is comparable to NCBI Entrez. It is
not as integrated as Entrez, but allows the user to query multiple databases
simultaneously, another good example of database integration. It also offers direct
access to certain sequence analysis applications such as sequence similarity
searching and Clustal sequence alignment. Queries can be launched using “Quick
Text Search” with only one query box in which to enter information. There are also
more elaborate submission forms, the “Standard Query Form” and the “Extended
Query Form.” The standard form allows four criteria (fields) to be used, which are
linked by Boolean operators. The extended form allows many more diversified
criteria and fields to be used. The search results contain the query sequence and
sequence annotation as well as links to literature, metabolic pathways, and other
biological databases.

[48]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 6
Cataloging biological databases
Primary nucleotide sequence database
The Primary Nucleotide Sequence Database consists of the following
databases.
¾ DNA Data Bank of Japan (National Institute of Genetics)
¾ European Nucleotide Archive (European Bioinformatics Institute)
¾ GenBank (National Center for Biotechnology Information)
The three databases, DDBJ (Japan), GenBank (USA) and European
Nucleotide Archive (Europe), are repositories for nucleotide sequence data from
all organisms. All three databases accept nucleotide sequence submissions, and
then exchange new and updated data on a daily basis to achieve optimal
synchronization between them. These three databases are primary databases, as
they house original sequence data.

Meta database:
These databases of databases collect data from different sources and make
them available in new and more convenient form, or with an emphasis on a
particular disease or organism.
¾ BioGraph - A knowledge discovery service based on the integration of
more than 20 heterogeneous databases
¾ Bioinformatic Harvester - Integrating 26 major protein/gene resources.
¾ Neuroscience Information Framework (University of California San
Diego) - Integrates hundreds of neuroscience relevant resources, many are
listed below.
¾ Entrez (National Center for Biotechnology Information)
¾ Enzyme Portal Integrates enzyme information such as small-molecule
chemistry, biochemical pathways and drug compounds. (European
Bioinformatics Institute)
¾ MetaBase (KOBIC) - A user contributed database of biological databases.

[49]
Bioinformatics Practical Manual K. C. Samal et al.

¾ PathogenPortal- A repository linking to the Bioinformatics Resource


Centers (BRCs) sponsored by the National Institute of Allergy and
Infectious Diseases (NIAID)
¾ SOURCE (Stanford University) encapsulates the genetics and molecular
biology of genes from the genomes of Homo sapiens, Mus musculus,
and Rattus norvegicus into easy to navigate GeneReports
Genome database:
These databases collect organism genome sequences, annotate and analyze
them, and provide public access. These databases may hold many species genomes,
or a single model organism genome.
¾ EcoCyc is a genome database that describes the genome and the
biochemical machinery of the model organism E. coli K-12
¾ SGD is a database that describes the genome & the biochemical and
molecular machinery of budding yeast (Saccharomyces cerevisiae).
¾ ACeDB is a database system of a nematode (Caenorhabditis elegans)
¾ TAIR is a genome database system of a widely used model plants
Arabidopsis thaliana,
¾ FlyBase is an online bioinformatics database and the primary repository of
genetic and molecular data of the extensively studied species and model
organism, Drosophila melanogaster
¾ Gramene (http://www.gramene.org/) is a curated, open-source, integrated
data resource for comparative functional genomics in crops and model
plant species
¾ OMIM (Online Mendelian Inheritance in Man) is a database on inherited
disorders and human genes.
¾ CAMERA is a database and repository of Resource for microbial genomics
and metagenomics
¾ Corn is the database of the Maize Genetics and Genomics .
¾ PATRIC, the PathoSystems Resource Integration Center
¾ RegulonDB RegulonDB is a model of the complex regulation of
transcription initiation or regulatory network of the cell E. coli K-12.

[50]
Bioinformatics Practical Manual K. C. Samal et al.

Protein sequence databases:


¾ UniProt Universal Protein Resource (EBI, Swiss Institute of
Bioinformatics, PIR)
¾ Protein Information Resource (Georgetown University Medical Center
(GUMC))
¾ Swiss-Prot Protein Knowledgebase (Swiss Institute of Bioinformatics)
¾ PEDANT: Protein Extraction, Description and ANalysis Tool
¾ PROSITE: Database of Protein Families and Domains
¾ Database of Interacting Proteins (Univ. of California)
¾ Pfam: Protein families database of alignments and HMMs (Sanger
Institute)
¾ PRINTS: a compendium of protein fingerprints from (Manchester
University). It is a database of (super-family and family) annotations for all
completely sequenced organisms
¾ InterPro Classifies proteins into families and predicts the presence of
domains and sites.
Proteomics database:
¾ Proteomics Identifications Database (PRIDE) A public repository for
proteomics data, containing protein and peptide identifications and their
associated supporting evidence as well as details of post-translational
modifications. (European Bioinformatics Institute)
¾ MitoMiner - A mitochondrial proteomics database integrating large-scale
experimental datasets from mass spectrometry and GFP studies for 12
species. (MRC Mitochondrial Biology Unit)
Protein structure databases:
¾ Protein Data Bank (PDB) comprising:
¾ Protein Data Bank in Europe (PDBe)
¾ Protein Data bank in Japan (PDBj)
¾ Research Collaboratory for Structural Bioinformatics (RCSB)
Secondary databases
¾ SCOP (Structural Classification of Proteins)
¾ CATH Protein Structure Classification
[51]
Bioinformatics Practical Manual K. C. Samal et al.

¾ PDBsum
Protein model databases:
¾ Swiss-model Server and Repository for Protein Structure Models
¾ ModBase Database of Comparative Protein Structure Models
(Sali Lab, UCSF)
¾ Protein Model Portal (PMP) Meta database that combines several
databases of protein structure models (Biozentrum, Basel, Switzerland)
RNA databases
¾ Rfam, a database of RNA families
¾ miRBase, the microRNA database
¾ snoRNAdb, a database of snoRNAs
¾ lncRNAdb, a database of lncRNAs
¾ piRNAbank, a database of piRNAs
¾ GtRNAdb, a database of genomic tRNAs
¾ SILVA, a database of ribosomal RNAs
¾ RDP, the Ribosomal Database Project
Carbohydrate structure databases
¾ EuroCarbDB, A repository for both carbohydrate sequences/structures and
experimental data.
Protein-protein interactions:
¾ BIND Biomolecular Interaction Network Database
¾ BioGRID, A General Repository for Interaction Datasets (Samuel
Lunenfeld Research Institute)
¾ CCSB Interactome
¾ DIP Database of Interacting Proteins
¾ IntAct molecular interaction database: a central, standards-compliant
repository of molecular interactions, including protein–protein, protein–
small molecule and protein–nucleic acid interactions.
¾ NetPro
¾ STRING: STRING is a database of known and predicted protein-protein
interactions. (EMBL)
[52]
Bioinformatics Practical Manual K. C. Samal et al.

¾ MINT: Molecular INTeraction database


Metabolic pathway databases:
¾ Small Molecule Pathway Database (SMPDB)
¾ BioCyc Database Collection including EcoCyc and MetaCyc
¾ KEGG PATHWAY Database (Univ. of Kyoto)
¾ MANET database (University of Illinois)
Microarray databases
¾ ArrayExpress (European Bioinformatics Institute)
¾ Gene Expression Omnibus (National Center for Biotechnology
Information)
¾ GPX(Scottish Centre for Genomic Technology and Informatics)
¾ Stanford Microarray Database (SMD) (Stanford University)
¾ Genevestigator - Expression Search Engine (Nebion AG)
PCR and quantitative PCR primer databases:
¾ PathoOligoDB: A free QPCR oligo database for pathogens
¾ RTPrimerDB - a public primers and probes database for real-time PCR
reactions
Taxonomic databases:
¾ Catalogue of Life source databases
¾ Encyclopedia of Life
¾ Integrated Taxonomic Information System
¾ EzTaxon-e, database for the identification of prokaryotes based on 16S
ribosomal RNA gene sequences

[53]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 7
Pairwise Sequence Alignment
In this document we illustrate how to perform pairwise sequence
alignments using the Biostrings package through the use of the pairwise Alignment
function. This function aligns a set of pattern strings to a subject string in a global,
local, or overlap (ends-free) fashion with or without an e gaps using either a fixed
or quality-based substitution scoring scheme.
Each of these pairwise sequence alignment problems is solved by
maximizing the alignment score. An alignment score is determined by the type of
pairwise sequence alignment (global, local, overlap), which sets the ranges for the
substrings; the substitution scoring scheme, which sets the distance between
aligned characters; and the gap penalties, which is divided into opening and
extension components. The optimal pairwise sequence alignment is the pairwise
sequence alignment with the largest score for the specied alignment type,
substitution scoring scheme, and gap penalties.
There are 3 methods for pairwise sequence alignment:
1) dot plot, 2) global alignment, and 3) local alignment.

Dot Plot
The simplest method is the dot plot. One sequence is written out
horizontally, and the other sequence is written out vertically, along the top and side
of an m x n grid, where m and n are the lengths of the two sequences. A dot is
placed in a cell in the grid wherever the two sequences match. A diagonal line in
the grid visually shows where the two sequences have sequence identity. Web-
based dot plot implementations can be found here:
ƒ http://www.vivo.colostate.edu/molkit/dnadot/ – for nucleotide sequence only
ƒ http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher - for both nucleic
acid & protein sequence with standard EMBOSS scoring matrices

[54]
Bioinformatics Practical Manual K. C. Samal et al.

ƒ http://www.changbioscience.com/res/resd.html – for any text string


ƒ Stand-alone dot plot programs operable via either a GUI or command-line
can be found in EMBOSS (JEMBOSS is the java GUI)

Global Alignment:
The algorithm published by Needleman and Wunsch in 1970 for alignment
of two protein sequences was the first application of dynamic programming to
biological sequence analysis. The Needleman-Wunsch algorithm finds the best-
scoring global alignment between two sequences. Global alignments are most
useful when the two sequences being compared are of similar lengths, and not too
divergent.

Local Alignment:
Real life is often complicated, and we observe that genes, and the proteins
they encode, have undergone exon-shuffling, recombination, insertions, deletions,
and even fusions. Many proteins exhibit modular architecture. In searching
databases for similar sequences, it is useful to find sequences that have similar
domains or functional motifs. Smith & Waterman (1981) published an application
of dynamic programming to find optimal local alignments. The algorithm is similar
to Needleman-Wunsch, but negative cell values are reset to zero, and the trace back
procedures starts from the highest scoring cell.

Scoring Matrices
The Needleman-Wunsch and Smith-Waterman algorithms require a
scoring matrix. The scoring matrix assigns a positive score for a match, and a
penalty for a mismatch. For nucleotide sequence alignments, the simplest scoring
matrix awards +1 for a match, and -1 for a mismatch. The blastn algorithm at NCBI
scores +5 for a match and -4 for a mismatch. These scoring matrices treat all
mutations (mismatches) equally. In reality, transitions (pyrimidine -> pyrimidine
and purine -> purine) occur much more frequently than transversions (pyrimidine -
> purine and vice versa). For aligning non-protein coding DNA sequences, a

[55]
Bioinformatics Practical Manual K. C. Samal et al.

transition/transversion scoring matrix may be more appropriate. For aligning DNA


sequences that encode proteins, alignment of the protein amino acid sequences will
almost always be more reliable.
For protein sequence alignments, the scoring matrices are more
complicated. The goal is to reflect evolutionary processes. Some amino acid
sequence changes can arise from a single nucleotide change, whereas other amino
acid changes require two nucleotide changes. Some amino acid changes are less
likely to affect protein structure or function than other amino acid changes.
Dayhoff used alignments of highly conserved proteins to assess what amino
acid changes were likely to be accepted – Point Accepted Mutations. From these
data she devised a 20 x 20 amino acid substitution matrix for PAM-1, a unit of
evolutionary change resulting in 1 accepted mutation per 100 amino acids. From
there she calculated other matrices such as PAM-2 or PAM-30 or PAM-250, where
the PAM-n matrix is derived by multiplying the PAM-1 matrix to itself n times.
The BLOSUM matrices (BLOcks SUbstitution Matrix) derive their amino
acid substitution frequencies from the Blocks database of un-gapped local multiple
sequence alignments. BLOSUM62 is calculated from sequences with 62% identity
or less; BLOSUM 80 from sequences with 80% or less.

Gap penalty
Sequence alignments usually require insertion of gaps, reflecting insertion
or deletion mutations. If a nucleotide or amino acid in one sequence is aligned to a
gap in the target sequence, then this should be penalized as a mismatch. However,
gaps at the ends of sequences should perhaps not incur any penalty. Moreover, a
single insertion or deletion mutation could result in a contiguous gap of multiple
residues. Therefore, a single gap that is 3 residues long should incur less penalty
than 3 different gaps, of one residue each.

[56]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 8
Multiple sequence alignment
Multiple Sequence Alignment (MSA) is a sequence alignment of three or
more biological sequences, generally protein, DNA, or RNA. In many cases, the
input set of query sequences are assumed to have an evolutionary relationship by
which they share a lineage and are descended from a common ancestor. From the
resulting MSA, sequence homology can be inferred and phylogenetic analysis can
be conducted to assess the sequences' shared evolutionary origins. Visual
depictions of the alignment as in the image at right illustrate mutation events such
as point mutations (single amino acid or nucleotide changes) that appear as
differing characters in a single alignment column, and insertion or deletion
mutations (indels or gaps) that appear as hyphens in one or more of the sequences
in the alignment. Multiple sequence alignment is often used to assess sequence
conservation of protein domains, tertiary and secondary structures, and even
individual amino acids or nucleotides.
Multiple sequence alignment also refers to the process of aligning such a
sequence set. Because three or more sequences of biologically relevant length can
be difficult and are almost always time- consuming to align by hand, computational
algorithms are used to produce and analyze the alignments. MSAs require more
sophisticated methodologies than pairwise alignment because they are more
computationally complex. Most multiple sequence alignment programs use
heuristic methods rather than global optimization because identifying the optimal
alignment between more than a few sequences of moderate length is prohibitively
computationally expensive.

[57]
Bioinformatics Practical Manual K. C. Samal et al.

Chapter 9
Practical Exercises
Exercise 1:
Making search for the scientific literature and sequences
Theory:
The most fundamental skill in bioinformatics is the ability to carry out an
efficient and comprehensive search of the scientific literature to find out what is
known about a specific subject. All of you are familiar with web search engines and
while they can be useful, they also turn up many items that have never undergone
the test of scientific peer review. Thus, this exercise is NOT a search of the World
Wide Web, but will introduce you to search the published scientific literature using
a database such as MEDLINE, Biological Abstracts or Chemical Abstracts. This
exercise will focus on the ‘Entrez browser’ entry to the national library of medicine
database MEDLINE (PubMed).
PubMed is a database service of the National Library of Medicine that cites
articles from MEDLINE and life science journals.

Procedure:

[58]
Bioinforrmatics Practical Manual K. C. Saamal et al.

1. To browse the World Wide Web, just open your favourite internet browser
(Internet eex
xplorer, Google chrome or Mozilla Fireffo
ox).
2. In the address bar, type the URL (http://www.ncbi.nlm.nih.gov/pubmed) and
press ‘Enter key’ on your keyboard.
The Homee page of your
T y site (here
( PubM
Med) as shhown below
w will apppear. A
search winndow and a text box will be diisplayed where
w you will
w type few
f key
w
words releevant to youur search topic.
t
To search scientific or
T o bibliogrraphic literrature in PuubMed, typpe key worrd(s) or
p
phrase(s) into
i the query box (e.g., a subjeect, author and/or jouurnal).

3. Type your key words and click the ‘Search’ button.


IIf necessarry, combinne search terms
t withh connectorr words: “AND,”
“ “O
OR,” or

“NOT” using upper case letters. PubMedd offers altternative searching
s o
options:
T Auto Suggest drrop-down menu appears whenn entering words;
The w andd Titles
w your search term
with ms option may
m appeaar after a seearch.
PubMed displays
P d a list of Results in Suummary foormat after clicking on the
‘Search’ button.
To retrievee more infoormation about
T a citatiion(s), use the Displaay Settingss link to
c
change howw the results are form
matted, sorrted and dissplayed.
Filters aree available in the lefft navigatiion bar annd may be used to limit
F l or
f
focus searrches. Clicck on a terrm to activvate or deactivate thhe filter. Multiple
M
f
filters may y be seleccted. Thee Filters activated message
m apppears aboove the
search resu ults list annd these lim
mits remaiin in effecct until rem
moved or cleared.
c
T reveal additional filter options, click the Choose additional filters or
To o more
l
links. Checck desired selections then clickk the Show w button.
[599]
Bioinformatics Practical Manual K. C. Samal et al.

4. For any entry in the Results list, click associated author names.
Search details, located in the right navigation column, provide information on
how PubMed ran a search. PubMed looks first for the entire word or phrase as
a MeSH term, then for journal titles, then authors. PubMed also searches “All
Fields” for the term. Search details shows how PubMed maps terms to MeSH
headings and subheadings. Changes to the search may be made in the Details
box; click Search to run the updated search strategy

5. Save what you like to your hard drive by choosing your browser’s File: Save
as option.

[60]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise-2:
Characterization of a Known Gene
URL: (http://www3.ncbi.nlm.nih.gov/qguery/gquery.fcgi)
Theory:
In this exercise, you will use ‘Entrez’ to find entries for the coding
sequence of a gene of interest. You will use glucokinase as an initial example
(glucokinase is the enzyme that catalyzes the initial step of glycolysis in liver and
several other cell types).:
Procedure:

1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox).
[61]
Bioinformatics Practical Manual K. C. Samal et al.

2. In the address bar, type the URL


(https://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi) and press ‘Enter key’ on
your keyboard.
The Home page of your site as shown below will appear.
3. In the left column, click ‘Nucleotide’ button.
A search window and a text box will be displayed where you will enter your
desired nucleotide.
4. In the top search box type in ‘glucokinase’ (without the quotes) and click on
the Go Button.
You will get about 1000 entries listed on more than 50 pages of 20 entries
each. This is an unwieldy number, so you will have to figure out a way to
narrow your search. There are two ways in general to narrow a search, the use
of the Limits menu within Entrez or the use of Boolean operators (AND, OR,
NOT)).
(The present search will pick up all entries in the database that have the word
glucokinase ANYWHERE in the entry (e.g. an entry that contains a line
stating "Gene X has nothing to do with glucokinase" will come up as a hit in
this search). You can eliminate some entries by adding after glucokinase in the
search box NOT similar NOT hypothetical. This will eliminate entries listed
only because they are noted to be similar to glucokinase.
Additional filters can be applied to our search by using the Limits tab just
below the search box.
5. Click on the Limits tab.
If you are interested only in the coding regions of glucokinase genes (i.e.
DNA sequences obtained from mRNA for glucokinase), you can eliminate
genomic sequences with their large introns.
6. In the "Molecule" pull-down menu select mRNA and click on the "GO"
button..

[62]
Bioinformatics Practical Manual K. C. Samal et al.

Note how many hits are now listed. You still have entries that are not
glucokinase. To further narrow your search clicks on the Limits tab one more
time. In the top left drop down menu change from All Fields to Title. This will
limit this search to those entries that have glucokinase in their title line. Still,
you will note that your entries include not only glucokinase but also
glucokinase regulatory proteins and other entries that have the term
glucokinase in the title.

Result:
• Clicking on the accession number for one of your entries will bring up the
full Nucleotide sequence information. Most of the information in an entry is
self-explanatory, but if you scroll down to the Features entry you should find
a CDS entry. This specifies that part of the nucleotide sequence below that
actually codes for a protein (often you will find untranslated regions at both
the 3' and 5' ends of a sequence). In addition, the translated sequence is given
in the one letter amino acid shorthand just above the full nucleotide
sequence.
• To obtain the sequence in a form which can be analyzed by a variety of gene
analysis software, select FASTA from the Display pull down menu. The
browser will give you a page which has the sequence without any line
numbers or breaks. Save the sequence by selecting the material beginning
with the > and going up to the last nucleotide (be sure to avoid the line above
the > and below the last nucleotide) and copying this to a word processor
program. The > line is recognized as comment by all analysis software. You
can change the font to courier 10 point to obtain the proper spacing and lines.

OBTAINING A PROTEIN FASTA ENTRY


• To compare protein sequences, you will want to obtain the protein FASTA
output.

[63]
Bioinformatics Practical Manual K. C. Samal et al.

• To obtain this change the Display menu back to the GeneBank Display and
scroll down until you reach the CDS information. Click on the link in the
line that begins /protein_id= "xxx1234" (i.e. whatever the assigned protein id
number is).
• This will change the display to GenPept and bring up a page which shows
some of the same information, but is limited to the amino acid sequence. In
this page, change the Display menu to FASTA to obtain an output similar to
the nucleotide FASTA output (an index line which begins with > and an
amino acid sequence). You can copy the index line and sequence to a word
processor for use later (once you are in the word processor, again change the
text to courier 10 pt to retain line spacing).
• SAVE THE PROTEIN FASTA OUTPUTS (glucokinase from mammal
species of your choice) to a word processor program. You will compare the
sequences of these proteins in a future exercise.

[64]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 3:
Finding out open reading frames (ORF) through NCBI ORF finder
URL: http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Theory:
Open reading frames are regions of DNA that encode the protein. This
DNA sequences are first transcribed into mRNA then translated into protein. By
examining the sequence alone, you can determine the sequence of amino acid that
will appear in the final problem. In translation codon of three nucleotides
determines which amino acid will be added next in joining protein chain. It is
important then to decide which nucleotide start translation and when to stop, this is
sequenced it in important to determine the correct open reading frames. So in each
direction i.e., 1, 2, 3 in forward and -1, -2, -3 in backward. The reading frame that
is used determines which amino acid will be conceded by a gene. Typically one
reading frame is used in translating a gene (in cukaryotes) and this is often the
largest ORF. Once the ORF is known DNA sequence can be translated into the
corresponding amino acid sequence.
An ORF starts with an ATG (methionine) in most of the species and ends in
a stop codon (UAA, UAG, UGA) indicated by * in the protein sequence.

Procedure:
1. To browse the World Wide Web, just open your internet browser (Internet
explorer, Google chrome or Mozilla Firefox etc)/
2. In the address bar, type www.ncbi.nlm.nih.gov/gorf/gorf.html and press
‘Enter key’ on your keyboard or click go button.
Here one can see a text field to enter the GI or accession number of the query
sequence, a text box to enter the query sequence in FASTA format and a
button to run the ORF finder.

[65]
Bioinforrmatics Practical Manual K. C. Saamal et al.

3. Type the nucleotide sequence in the box provided ((iin FASTA format) or copy
your nucleotide sequence from a .txt file or word document file and passtte the
sequence in the inpuutt box.
FASTA fo
F ormat is a simplest sequence foormat whicch starts with
w a ‘>’ symbol
s
f
followed by the sequence ID, otheer commeents and computattionally
r
represented protein sequence).
s
There is a drop downn menu to select a geenetic codoon dictionaary. It conttains 20
T
d
different codon
c dictiionaries thhat containn codons for
f differennt organism ms and
o
organelles . Select anny from thee list whichh you wantt for the seaarch methood. The
f
first one iss the "standdard" whicch is the deefault codoon. Select default
d coddon list
‘Standard’’. (For exam mple, the standard
s coode AUG code
c for methionine.
m . But in
V
Vertebrate e Mitochonndrial Codde and Yeaast Mitochhondrial Code, AUA A codes
f methio
for onine).
4. Now Click the ORF finder button to get the result.
The result shows thee all the poossible sixx reading frame
T fr preseent in the entered
e
sequence query.
q Onee can see thhat the OR RF is listedd accordingg to their size and
t graphiccal represeentation of the of the sequence.
the
5. Click on the green region which represents the ORF in the sequence, to see the
ORF.
Once you click, it will
O w turn innto purplee colour inndicating thhat the particular
O
ORF is seelected. The
T selecteed ORF iss also inddicating in the list. It also
d
displays th
he length annd locationn of the sellected ORF
F

[666]
Bioinformatics Practical Manual K. C. Samal et al.

One can see the sequence of the selected ORF which actually codes for the
protein. The user can find the start codon, stop codon and the total number of
the amino acids from the sequence. Now click on Accept button.
User can also perform a BLAST search for the particular ORF that you
selected. Select the appropriate program and database. Then click on the
BLAST button.

.
[67]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 4:
Translating an unknown DNA Sequence
URL: http://web.expasy.org/translate/
Theory:
One of the most basic exercises in bioinformatics is determining if a nucleic
acid sequence actually codes for a protein. This is complicated by the fact that you
generally do not know which strand is the coding strand (i.e. whether the sequence
itself or its complementary strand will be transcribed into mRNA) nor the correct
reading frame (whether the sequence should be read three bases at a time starting
with the first nucleotide, the second or the third. Both these questions are resolved
by translating both strands in all three reading frames and looking for the one that
gives the longest amino acid sequence before a stop codon is encountered. Since
there are 64 codons and three of these codons (UAA, UAG and UGA) do not code
for any amino acid (i.e. are stop signals). You expect a stop codon to appear on
average once every 20 amino acids if you are reading a sequence in the incorrect
frame. However, things are not always that clear cut and it is possible for an out of
frame translation to extend to over 100 amino acids before a stop codon is reached.
In the exercise below you will be given an unknown DNA sequence and
asked to use a web tool to translate the sequence into an amino acid sequence and
hopefully identify the proper reading frame. You will then save this amino acid
sequence to a word processing program for use it in the next exercise.

Requirement
The sequence might be obtained by sequencing a clone from a cDNA
library or by isolating an amplified DNA fragment from PCR amplification.
Otherwise you get a sequence from nucleic acid sequence database as studied
earlier.

[68]
Bioinformatics Practical Manual K. C. Samal et al.

Procedure:
1. To browse the World Wide Web, just open your internet browser (Internet
explorer, Google chrome or Mozilla Firefox).
2. In the address bar, type the URL http://web.expasy.org/translate/ and press
‘Enter key’ on your keyboard.
A new window will open to assess the translation tool. (Translating the DNA
sequence is done by reading the nucleotide sequence three bases at a time and
then looking at a table of the genetic code to arrive at an amino acid sequence.
This program examines the input sequence in all six possible frames (i.e.
reading the sequence from 5' to 3' and from 3' to 5' starting with nucleotide at
position 1, 2 and 3 separately). What you typically look for in identifying the
proper translation is the frame that gives the longest amino acid sequence
before a stop codon is encountered. (Since there are 64 codons and three code
for nonsense, you expect a stop codon to appear on average once every 20
amino acids if you simply read a sequence "out of frame". However, "on
average" is just that, and it is possible to have an incorrect reading frame give
an extended sequence with no stop codons. The next exercise will address that
problem).
3. Type or paste your sequence in the sequence window in the ExPasy link for
translation.
Under Output format select either ‘Compact’ or ‘Verbose’. ‘Compact’ gives
the amino acid sequence as one letter codes with stop codons indicated by a
hyphen whereas ‘Verbose’ gives the amino acid sequence as three letter codes
4. Select Output format clicking either ‘Compact’ or ‘Verbose’
5. Click on Translate Sequence
Often only one reading frame will give you a translation with no stop codons,
but this is not always the case. If you get multiple possible reading frames,
one way to determine which is the most likely the true frame is to use the

[69]
Bioinformatics Practical Manual K. C. Samal et al.

BLAST program to determine if the sequence corresponds to any known


protein sequence.
Using the "Compact output" to get one letter sequences, copy the one letter
sequence of the best reading frame (i.e. one with no stop codons) and paste it
into the window below labeled "Best Guess".
6. Copy the longest amino acid sequence (i.e. no hyphen) of one of the other
reading frames to the window below labeled "Second Best".
If you have two reading frames without a stop codon, simply copy each to the
boxes below.
7. Copy and save each sequence to a word processor for use in next exercise.
Best reading Frame Amino acid sequence from next best Frame (don't include
the stop)

Conclusion:
You have now been introduced to the use of a translation program to
identify the most probable reading frame and to translate an unknown sequence.
What if none of the six possible reading frames gives an extended amino acid
sequence? This could be due to your having errors in sequence (you need to
sequence both strands to ensure an accurate sequence). Or you may have isolated a
non-coding region of DNA (e.g. you know that the 5' and 3' ends of most genes are
not coding for protein, but serve regulatory functions. There are many untranslated
regions of DNA (exons, pseudogenes, etc). You can now take the two amino acid
sequences and determine if either matches any known sequences in the huge
protein sequence database

[70]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 5:
Identifying a gene using BLAST program
URL: http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
Once you have identified a likely reading frame for your DNA sequence, you
will want to see if it corresponds to any known protein. Alternatively, if you
obtained two reading frames of nearly equal length, you will need to decide which
is correct. To accomplish these tasks, you can compare your sequences to all of the
known protein sequences in the databases using a search tool known as BLAST.
BLAST comes in a variety of formats depending on whether you are using a DNA
sequence or a amino acid sequence and depending on whether you are searching
through nucleotide or protein databases.
You are going to do this exercise twice. First, you will take the longest open
reading frame and use it as a query sequence with BLASTP. After saving those
results, you will then take the next longest amino acid sequence and use it as our
query sequence.

Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Paste your longest translated sequence into the first box below.
5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
6. Deselect the Do CD-search box.

[71]
Bioinformatics Practical Manual K. C. Samal et al.

Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the
Layout to One Window. You will leave the Options section settings on the
Default values.
7. Click the BLAST button at the bottom or top of the screen
A new window will appear gives an estimate of how long the search will take
and which lists conserved domains in your query sequence. You may want to
copy your request id number, but usually this isn't necessary. After the
indicated time has passed,
8. Press the Format button
The results of your search will be dispayed. If similarity to any known protein
has been found, you will see a color window (which may or may not print)
showing the degree of similarity and the range of similarity. Perfect matches
show up as red, next best as purple, mediocre as green, poor matches as blue
and very poor or no match as black. If you scroll down you will see the best
10 alignments (make sure you have limited this to 10!). If the DNA sequence
has already been identified it should show up as a perfect match (score
generally between 200-400, but could be lower depending on size of peptide
analyzed. The E value will be down around 10(-50) to 10(-100)).The E value
tells you the probability that an unrelated sequence in the database could have
given the score value.
Copy the line below the color alignment window which shows the sequence
producing the best alignment. This will give you the identifiers (gi number
and other identifying numbers) you will need to download the full protein
from the database for characterization. Save this information.

[72]
Bioinformatics Practical Manual K. C. Samal et al.

[73]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 6:
Finding Domains in Protein Sequences
Theory:
Many proteins which have been classified as "globular (i.e. folded into a
compact globular shape) appear to be composed of several distinct folded regions
joined by more extended loops of amino acids. These globular sub-regions are
termed "domains" and can range in size from 20-300 amino acids. Some domains
have been associated with specific functions (e.g. catalysis of peptide bond
cleavage, ATP binding, etc), but this association must be tentative since ligand
binding or formation of an active site often takes place at the surface where two
domains interact. Identification of domains can help us to assign a newly
discovered open reading frame to a family of proteins. Domains in a newly
discovered protein can be recognized by sequence homology with known domains
in well characterized proteins, but this is still not a precise science. While new
techniques of analysis are being introduced, at the present the most user-friendly
and visual domain identification program is the SMART domain annotation
database.

Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://smart.embl-
heidelberg.de/smart/set_mode.cgi?NORMAL=1 and press ‘Enter key’ on your
keyboard.
The requested page at appears as shown below
3. Copy the full sequence of the protein identified in the previous Exercise and
past it into SMART sequence window.
4. Click the Sequence SMART button.

[74]
Bioinformatics Practical Manual K. C. Samal et al.

Depending on how busy the SMART server is, it may take a few minutes for a
result to be returned. BE PATIENT!!
The results will show you a live diagram with the domains within the query
sequence. Each domain has a unique color and shape and annotation.
Scroll down the window to see a table that lists each identified domain
together with its putative (probable) start and end point in your sequence and
the probability (E-value) assigned to that identification (the smaller the e-
value the more likely the identification is not simply due to chance).
5. Click the mouse over the domain on the figure or in the table.
It will bring up the domain name or abbreviation and the amino acid sequence
assigned to this domain at the very bottom of the window. With a PC, right
click on the image to save it as a PNG file. It can be opened Photoshop or
most any other reader.
6. Click on the domain name
It will bring up more detailed information on the domain.
Pick out one domain to examine in detail.
What are the characteristics (amino acid sequences) that define that domain?
What kinds of proteins contain this domain?
What is the function of that domain?
How similar is your sequence to the defined domain?

[75]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 7:
Nucleotide BLAST (BLASTn):
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
The BLAST (Basic Local Alignment Search Tool) programs have been
designed for speed to find high scoring local alignments. BLAST uses a heuristic
algorithm which seeks local as opposed to global alignments and is therefore able
to detect relationships among sequences which share only isolated regions of
similarity
BlastN is a pair wise sequence comparison tool developed by NCBI and the
programme compares a nucleotide query sequence with nucleotide sequence data
base. It takes nucleotides sequences and compares them against the NCBI
nucleotide databases. It is better at finding sequences similar, but not identical, to
your query.

Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Nucleotide BLAST [blastn] link
A search page will appear as shown below.
4. Paste your nucleotide sequence into the first box below.
5. Choose nr database from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option
6. Deselect the Do CD-search box.

[76]
Bioinformatics Practical Manual K. C. Samal et al.

Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the
Layout to One Window. You will leave the Options section settings on the
Default values and will address these choices in a more advanced exercise.
7. Click the BLAST button at the bottom or top of the screen
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
8. Then click the BlastN option at the end of the submission page.
After few second the result of your blast programme will appear in a new
window. The first part shows a Graphic View of the matches, followed by a
list of the matches and then the Individual Alignments. In the result page a
number of hits were displayed. Out of large number of sequence those hits
were chosen on basis of lowest e- value. The sequences showing e- value is
more similar to each other.

[77]
Bioinformatics Practical Manual K. C. Samal et al.

[78]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 8-
Protein BLAST (Blastp):
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
THEORY –
BlastP is a pair wise sequence comparison tool developed by NCBI and the
programme compares a amino acid query sequence of a protein with amino acid
sequence of protein data base. It takes amino acid sequences and compares them
against the NCBI protein databases. The program allows to discover the structures
and functions of proteins.
BlastP uses the BLAST algorithm to compare an amino acid query
sequence against a protein sequence database.

Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Paste your amino acid sequence of a protein or longest translated sequence
into the first box below.
5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
6. Deselect the Do CD-search box.
Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the

[79]
Bioinformatics Practical Manual K. C. Samal et al.

Layout to One Window. You will leave the Options section settings on the
Default values and will address these choices in a more advanced exercise.
7. Click the BLAST button at the bottom or top of the screen
After few second the result of our blast programme will appear in a new
window. The first part shows a Graphic View of the matches, followed by a
list of the matches and then the Individual Alignments. Here a number of hits
were displayed. Out of large number of sequence, those hits were chosen on
basis of lowest e- value. The sequences showing e- value is more similar to
each other.

[80]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise-9
Translated BLAST (Blastx)
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
Blastx searches protein database using a translated nucleotide query. Blastx
uses the BLAST algorithm to compare the six-frame conceptual translation
products of a nucleotide query sequence (both strands) against a protein sequence
database. The BLAST (Basic Local Alignment Search Tool) programs have been
designed for speed to find high scoring local alignments. BLAST uses a heuristic
algorithm which seeks local as opposed to global alignments and is therefore able
to detect relationships among sequences which share only isolated regions of
similarity

Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastx] link
After clicking a new page appear. This is the sequence submission page
3. Enter the Nucleotide sequence into the Search dialog box.
4. Use the default settings to search the Non-redundant protein sequences (nr)
database.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Select the
search and format options that you want for your data output. For some
proteins you may gets hundreds of hits. Therefore, you would limit the
number on the first search. Recheck that all the information is correct.

[81]
Bioinformatics Practical Manual K. C. Samal et al.

5. To submit the request, Click the BLAST button at the bottom or top of the
screen.
After few second the result of our blastx programme will appear in a new
window. Number of hits will be displayed. The blastx report is very similar to
the blastn report. The first part shows a Graphic View of the matches,
followed by a list of the matches and then the Individual Alignments. The
BLASTX search with the same sequence shows a significant number of very
good matches. Out of large number of sequences those hits were chosen on
basis of lowest e- value.

[82]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 10:
tBLASTX
URL: http://www.ncbi.nlm.nih.Gov/BLAST
Theory:
TBlastx compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database using the
BLAST algorithm.

PROCEDURE –
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click [tblastx] link.
That searches translated nucleotide database using a translated nucleotide
After clicking a new page appear. This is the sequence submission page
3. Enter the nucleotide sequence into the Search dialog box.
4. Use the default settings to search the Non-redundant protein sequences (nr)
database.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Select the
search and format options that you want for your data output. For some
proteins you may gets hundreds of hits. Therefore, you would limit the
number on the first search. Recheck that all the information is correct.
5. To submit the request, Click the BLAST button at the bottom or top of the
submission page screen.

[83]
Bioinformatics Practical Manual K. C. Samal et al.

After few second the result of our blast programme will appear in a new
window. Out of large number of sequences those hits were choose on basis of
lowest e- value. The sequences showing e- value is more similar to each other.

[84]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 11:
PSI-BLAST (position specific interacted BLAST)
URL: http://www.ncbi.nlm.nih.Gov/BLAST
Theory:
Position specific iterative BLAST (PSI BLAST) was created in 1997. PSI-
BLAST represents an extension of BLAST where position specific scoring is used.
What this means is that when looking for word matches in the database, you create
a “profile” or family for the words you are looking for. Once you found all matches
within a certain significance threshold, you use the obtained profiles to refine the
search by repeating the procedure. This allows us to find more significant matches.
The profiles are represented as substitution matrices.
Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Under program selection heading, click PSI-BLAST (Position-Specific
Iterated BLAST) button
5. Paste your protein sequence in search window section or simply write the GI
number of the protein.
6. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Enter the
threshold values that determine how divergent the protein that you are

[85]
Bioinformatics Practical Manual K. C. Samal et al.

interested in finding one. The rest of the parameter are generally used at the
set default settings.
7. Then click on the BLAST button to initiate the first round of PSI BLAST
search.
The time it takes can be longer than what it says on screen. Be patient. An
intermediate page (entitled Reformatting Blast) appears containing a ‘Format’
button.
8. Click this Format button.
A new page appears in a new window entitled results of Blast., This is where
your results will be displayed when ready.
9. Inspect the results.
There are many very similar sequences and only a few distantly related.
10. Click on the run PSI BLAST iteration 2 button (Near the top of the page).
The Reformatting Blast window pops up.
11. Click the Format button on the Reformatting Blast window.
The results will appear in the results of the Blast window. This can be
repeated till a convergence of protein is achieved or known further
convergence is possible.
12. Continue repeating Steps 10-11.
The results will appear in the results of the Blast window. PSI BLAST output
consists of many iterations. Each iteration has a hit list, the alignment and the
parameters used for the analysis of PSI BLAST. Each iteration page contains
an interaction button to go through the next interaction.

Conclusion
PSI BLAST program is most widely used protein similarity search program
among the entire BLAST program. PSI BLAST offers exiting opportunities to

[86]
Bioinformatics Practical Manual K. C. Samal et al.

discover new type of relationship in protein data base and use to infer evolutionary
origins of protein. The PSI BLAST is a highly sensitive homology search program.

[87]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 12:
Sequence alignment through FASTA
URL: http://www.ebi.ac.uk/Tools/sss/fasta/
Theory:
Compare a protein sequence to a protein sequence database using the
FASTA algorithm (Pearson and Lipman, 1988, Pearson, 1996). It provides
sequence similarity searching against protein databases using the FASTA suite of
programs. FASTA provides a heuristic search with a protein query. FASTX and
FASTY translate a DNA query. Optimal searches are available with SSEARCH
(local), GGSEARCH (global) and GLSEARCH (global query, local database.
Search speed and selectivity are controlled with the ktup (wordsize) parameter. For
protein comparisons, ktup = 2 by default; ktup =1 is more sensitive but slower.
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://www.ebi.ac.uk/Tools/sss/fasta/ and press ‘Enter
key’ on your keyboard.
The FASTA homepage will appear in which the different options like
program, database, result, search title, your email, matrix, gap extension, k-
tup, expected lower value, DNA strand, histogram, mode type, score,
alignment, sequence pair database range, filter statistical estimate.
3. Under Basic Program heading, click Protein link.
4. Select the date base from data base pull sown menu.
5. Paste your sequence or upload the file containing sequence.
6. Set your parameters.
Matrix: Matrix option is used to set the matrix which is used for searching the
data base.

[88]
Bioinformatics Practical Manual K. C. Samal et al.

Gap penalties: it has two options one is Gap opening and Gap extension.
Default gap opening penalty for proteins is -12 and -16 for DNA. The gap
extension penalty is -2 for protein and -4 for DNA.
Score: Score option gives the maximum number of reported scores in the
output file.
K-tup: Change this value to limit the word length. The search should use.
Strand: This option let you chose which strand to search with the respective
data band.
Histogram: Selecting this option to ‘yes’ will display the search histogram of
the expected frequency of chance occurrence of the data base matches found.
Expectation value upper limit and lower limit: This option is used for score an
alignment display. The default values for upper limit are 10.0 for protein
search,
• Sequence range: This option allows the user to denote which region within
the query seq. should be searched.
• Database range option sets the sequence range to search within the dbs.
• Multype : The multypeoptionis used to choose the molecule type of the
query in use for a search.
• Filter: This option can eliminates statistical significance but biological
uninteresting reports from the first FASTA search.
• Statistical estimates option is used for statistical calculations.
• Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
7. Then click the submit button.
For some proteins you may gets hundreds of hits. Therefore, you would limit
the number on the first search. Recheck that all the information is correct. A
histogram along with the alignment will come.

[89]
Bioinformatics Practical Manual K. C. Samal et al.

[90]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 13:
Editing and analyzing multiple sequence alignment using Jalview
URL: http://www.jalview.org/
Theory:
Jalview is a piece of bioinformatics software that is used to look at and edit
multiple sequence alignments. It is written in the Java programming language.
Jalview is a free program for multiple sequence alignment editing, visualization
and analysis. Jalview has a wide range of functions and is used to view and edit
sequence alignments, analyze them with phylogenetic trees and principal
components analysis (PCA) plots and explore molecular structures and annotation.

Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://www.jalview.org/ and press ‘Enter key’ on your
keyboard.
3. Paste the MSA or on align sequences into the seq. window then click the run
button so that an initial result page will appear.
4. Then the browser returns a page which loads the java applet into the memory
of the computer then inside this page the word Jalview appears as a button.
Then click on the Jalview button to obtain the result.
Jalview can run in offline. For this load the Jalview into the computer when
selected the file menu and click word offline option in the internet browser.
In the Jalview window select file and then click input alignment via text box.
Paste the MSA in the text box and then selected the format that
correspondence to the MSA for the alignment format top down menu. Then
click, the apply button.
Result :
¾ In the result page of Jalview edit a group of sequence by using the editing
window in the result page.

[91]
Bioinformatics Practical Manual K. C. Samal et al.

¾ In the pop of window click the odd new group button and then the add
selected ids button. Then click apply and choose button to finish.
¾ Then choose edit and when a group editing mode from the main menu and
then click or anywhere on a sequence and drag to the left or right to insert
or remove gaps.
¾ Save the alignment that is produced from Jalview by using the following
options:

Choose file and then output alignment via textbox from the Jalview
main menu.
• Then select the alignment format and click apply button to get a
formatted alignment appears in the window.
• Then open a Microsoft word document. Select, copy and paste the
alignment from the Jalview textbox to the word document and save the
document.
¾ For publishing the multiple sequence alignment use the box shed utility
which sheds the column according to their level of conservations and
produces files that are useful for publication.
Conclusion:
Jalview is a online and offline tool for editing and analyzing the NSA which
gives a good looking format which can then be used for publishing.

[92]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 14:
Making multiple alignment with T-coffee
URL: http://www.ch.embnet.org/soiltware/Tcoffee.html
Theory:
T-Coffee (Tree-based Consistency Objective Function For alignment
Evaluation) is the multiple sequence alignment software using a progressive
approach. It generates a library of pairwise alignments to guide the multiple
sequence alignment.
T-Coffee has two main features. First, it provides a simple and flexible
means of generating multiple sequence alignments, using heterogeneous data
sources. The data from these sources are provided to T-Coffee via library of pair-
wise alignments. The second main feature of T-Coffee is the optimization method,
which is used to find the multiple alignment that best fits the pair-wise alignments
in the input library. You use a so-called progressive strategy which is similar to that
used in ClustalW. This has the advantage of being fast and relatively robust. T-
Coffee is a progressive alignment with an ability to consider information from all
of the sequences during each alignment step, not just those being aligned at that
stage.

Procedure
1. Open any internet browser like Internet explorer, Google chrome etc.
2. In the address bar write NCBI and click on enter button then Home page will
come.
3. Search for any two or more nucleotide sequences in FASTA format and copy
it on Microsoft word page.
4. Open new internet tab and search for T-Coffee.
5. Home page will come Point the browser to the T-coffee server homepage.
6. Click the mouse over make a multiple alignment in the table and click
regular. By clicking the mouse the multiple alignment page appears.

[93]
Bioinformatics Practical Manual K. C. Samal et al.

7. Enter the E-mail address in that page, so that if the job time will out the result
can be returned by E-mail.
8. Paste the sequences in the box used for alignment and then click the T-Coffee
button at the top or the button of the page to obtain the result.
9. Click on Submit button
Then T-Coffee alignment result will come

Result
T-Coffee returns a table that contains hyperlinks to the result.
The first row of the table is duplicated to multiple sequence alignment and
includes
¾ Aln- A text file in the same format as clustalW alignments
¾ HTML- A colourised alignment where every residue appears on a
background that indicates quality of this alignment. Rcad indicates high
quality segments while blue indicates no trusted region.
¾ Pdf- It can be easy or to display and print due to pdf file.
The second row dedicated to phylogenetic tree and includes:
¾ Dnd- The guide tree or dendrogram generated by Tcoffee in newick
format.
¾ Ph- This is a real phylogenetic tree in newick format using the neighbor
joining method.
¾ Png – The gif picture of the phylogenetic tree that corresponds to the Ph
file.
Advantages
¾ It produces more accurate alignments than the other methods.
¾ It is equipped with many different tools and modules such as CORE, M-
coffee and EXPRESSO for structure alignment, evaluation and combining
alignments.
¾ T-coffee can deal with many input formats, including FASTA, Swiss-Prot
and PIR (Protein Information Resource).
¾ T-coffee produces sequence alignment in various formats so that it can be
used as an input for another program. It also produces a colorized
[94]
Bioinformatics Practical Manual K. C. Samal et al.

alignment where every residue appears on a background that indicates the


quality of this alignment in (.html) and (.pdf) format.
¾ It can produce true phylogenetic tree in Newick format by using the
Neighbor Joining method.
¾ It can work with list of DNA, RNA or protein sequences.
¾ T-coffee can evaluate the quality of any multiple sequence alignments
using CORE server.
Disadvantages
¾ It takes longer time to align multiple sequences than other programs.
¾ It has been cited in limited number of peer reviewed journals compared to
ClustalW.
Conclusion:
T-coffee is a progressive alignment method and in many ways it is similar
to clustalw but the main difference is that T-coffee does not directly use
substitution matrices to align sequences. It is much lazier and simply relies on other
methods to work for it.

[95]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 14
Performing Online Mendelian Inheritance in Man (OMIM)
URL:- http://www.ncbi.nlm.nih.gov/omim / www.omim.org
Theory:-
OMIM is a comprehensive, authoritative compendium of human genes and
genetic phenotypes that is freely available and updated daily. OMIM is authored
and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins
University School of Medicine, under the direction of Dr. Ada Hamosh. Its official
home is omim.org.(According to NCBI)

[96]
Bioinformatics Practical Manual K. C. Samal et al.

Procedure:-
¾ Open any internet browser Internet explorer/google chrome/mozilla firefox
¾ In the address bar type OMIM and press enter or search
¾ Different websites with little explanation are appeared in new page. Study
the listed websites and then click any one of them till you get your require
information.
¾ Open any internet browser
¾ In the address box type www.google.com
¾ In the search box type OMIM
¾ Different websites with little explanations will appear
¾ Study the listed web site and click any one of them i.e
http://www.ncbi.nlm.nih.gov/omim / www.omim.org
¾ When you type www.omim.org you get its home page
¾ On the search box of that page type any human gene suppose insulin, then
click on search
¾ On the new page you will get different aspects on human gene
¾ From that click on the desired aspect
¾ Suppose you click on #610549 Icd+ Diabetes Mellitus, Insulin-Resistant,
with Acanthosis nigricans

[97]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 16
Studying about Protein Structure Database
URL: http://www.rcsb.org/pdb/home/home.do
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
Theory:
The Protein Data Bank (PDB) is a repository for the three-dimensional
structural data of large biological molecules, such as proteins and nucleic acids..
The data, typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organizations (PDBe,
PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide
Protein Data Bank (wwPDB)
Procedure:
1. Open any Internet browser or google chrome or mozilla firefox etc.
2. In the address bar click www.google.com
3. Then in search bar type PDB,SCOP and CATH.
4. Press enter or click the search button.
5. Different websites with little explanation are appeared in new page.
6. Study the listed websites and anyone of that till you get then write.

[98]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 17
Depositing sequences in database
URL: BankIt [http://www.ncbi.nlm.nih.gov/BankIt/],
Sequin http: //www.ncbi.nlm. nih.gov/Sequin/index.html
The GenBank sequence database is an annotated collection of all publicly
available nucleotide sequences and their protein translations. This database is
produced at National Center for Biotechnology Information (NCBI) as part of an
international collaboration with the European Molecular Biology Laboratory
(EMBL) Data Library from the European Bioinformatics Institute (EBI) and the
DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive
sequences produced in laboratories throughout the world from more than 100,000
distinct organisms. GenBank continues to grow at an exponential rate, doubling
every 10 months.
Direct submissions are made to GenBank using
1. BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], which is a Web-based form,
or the stand-alone submission program, or
2. Sequin [http: //www.ncbi.nlm. nih.gov/Sequin/index.html].
Upon receipt of a sequence submission, the GenBank staffs assign an
Accession number to the sequence and perform quality assurance checks. The
submissions are then released to the public database, where the entries are
retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed
Sequence Tag (EST), Sequence Tagged Site (STS), Genome Survey Sequence
(GSS), and High-Throughput Genome Sequence (HTGS) data are most often
submitted by large-scale sequencing centres. The GenBank direct submissions
group also processes complete microbial genome sequences.

Submission Tool:
Direct submissions to GenBank are prepared using one of two submission
tools, BankIt or Sequin.

[99]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 18:
Submitting sequences to Genbank through ‘BankIt’
URL: [http://www.ncbi.nlm.nih.gov/BankIt/]
Theory:
BankIt is a Web-based form that is a convenient and easy way to submit a
small number of sequences with minimal annotation to GenBank. To complete the
form, a user is prompted to enter submitter information, the nucleotide sequence,
biological source information, and features and annotation pertinent to the
submission. BankIt has extensive Help [http://www.ncbi.nlm.nih.gov/
BankIt/help.html] documentation to guide the submitter. Included with the Help
document is a set of annotation examples that detail the types of information that
are required for each type of submission. After the information is entered into the
form, BankIt transforms this information into a GenBank flat file for review. In
addition, a number of quality assurance and validation checks ensure that the
sequence submitted to GenBank is of the highest quality. The submitter is asked to
include spans (sequence coordinates) for the coding regions and other features and
to include amino acid sequence for the proteins that derive from these coding
regions. The BankIt validator compares the amino acid sequence provided by the
submitter with the conceptual translation of the coding region based on the
provided spans. If there is a discrepancy, the submitter is requested to fix the
problem, and the process is halted until the error is resolved. To prevent the deposit
of sequences that contain cloning vector sequence, a BLAST similarity search is
performed on the sequence, comparing it to the VecScreen
[http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html] database. If there is a
match to this database, the user is asked to remove the contaminating vector
sequence from their submission or provide an explanation as to why the screen was
positive. Completed forms are saved in ASN.1 format, and the entry is submitted to
the GenBank processing queue. The submitter receives confirmation by email,
indicating that the submission process was successful.

[100]
Bioinformatics Practical Manual K. C. Samal et al.

Requirements for GenBank Submissions through ‘BankIt’


Contact Information
Name, address, phone number, fax number and email address of the
submitter must be entered when registering and submitting for the first time.
Subsequent BankIt submissions will retain this information and display it once the
submitter logs in
Release date information
Immediately after it is processed at NCBI OR On a date the submitter
specifies
Reference information
Sequence authors: names of the researchers who are credited with the
sequence
Publication information:
Unpublished, In-Press, or Published; and applicable citation information
(paper's title, authors, journal title, volume, issue, year, pages)
Submission Category and Type
Original sequencing or Third Party Annotation, Single sequence, sequence
set (phylogenetic, population, environmental, etc), or batch
Nucleotide sequence(s)
¾ Input (cut-and-paste) single or multiple sequences OR
¾ Upload them as a FASTA file; FASTA files should include organisms in
their definition lines
¾ Sequences must be at least 200 nucleotides long (unless they are complete
exons, non-coding RNAs (ncRNAs), microsatellites or ancient DNA)
¾ Molecule type: what was sequenced? (genomic DNA, mRNA, genomic
RNA, cRNA, etc)
¾ Topology: linear or circular (circular must be complete, such as a complete
plasmid)

[101]
Bioinformatics Practical Manual K. C. Samal et al.

Organism name, applicable source modifiers, location


¾ Genus and species names (if not previously provided in FASTA file)
¾ If name is new or unrecognized, provide best known taxonomic lineage
¾ If genus and/or species names are not known, provide most specific name
known (for example: Bacillus sp., Uncultured bacterium, Uncultured
archaeon)
¾ Most complete name for any synthetic vector (for example: Cloning vector
pAB234, Transfer vector p789Abc)
¾ Source modifiers include: strain, clone, isolate, specimen-voucher,
isolation-source, country
¾ Location: organelle (mitochondrion, chloroplast, etc); map and/or
chromosome
Features of the sequence
Upload files or use input forms to add all applicable features (for example:
CDS, gene, rRNA, tRNA, microsatellite, exon, intron)

[102]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 19:
Submitting sequences to Genbank through ‘Sequin’
URL: http://www.ncbi.nlm.nih.gov/Sequin/index.html
Theory:
Sequin is more appropriate for complicated submissions containing a
significant amount of annotation or many sequences. It is a stand-alone application
available on NCBI's FTP [ftp://ftp.ncbi.nih.gov/sequin/] site. Sequin creates
submissions from nucleotide and amino acid sequences in FASTA format with
tagged biological source information in the FASTA definition line. As in BankIt,
Sequin has the ability to predict the spans of coding regions. Alternatively, a
submitter can specify the spans of their coding regions in a five column, tab-
delimited table [http://www.ncbi.nlm.nih.gov/Sequin/table.html] and import that
table into Sequin. For submitting multiple, related sequences, e.g., those in a
phylogenetic or population study, Sequin accepts the output of many popular
multiple sequence-alignment packages, including FASTA+GAP, PHYLIP,
MACAW, NEXUS Interleaved, and NEXUS Contiguous. It also allows users to
annotate features in a single record or a set of records globally.

Procedure for Depositing Sequence by Sequin:


¾ Open any internet browser like Internet explorer, Google chrome etc.
¾ In the address bar write NCBI and click on enter button then Home page
will come.
¾ Click on Submissions: Submit data to GenBank or other NCBI databases
¾ Click on Genbank option Homepage will occur.
¾ Click on Submission Tools option
¾ Click on Sequin Tool.
¾ In How to Get Sequin click on Instruction
¾ Download the free downloaded Sequin software.
¾ Install it in your PC or Laptop.

[103]
Bioinformatics Practical Manual K. C. Samal et al.

¾ Open that software by double clicking on software icon then this Welcome
page will occur.

¾ Click on Start New Submission button.


¾ Sequin is organized into a series of forms for entering submitting authors,
entering organism and sequences, entering information such as strain, gene,
and protein names, viewing the complete submission, and editing and
annotating the submission.
Author Submission Form

¾ The Sequence Format form asks for the type of submission (single
sequence, segmented sequence, or population, phylogenetic, or mutation

[104]
Bioinformatics Practical Manual K. C. Samal et al.

study). For the last three types of submission, which involve comparative
studies on related sequences, the format in which the data will be entered
also can be indicated. The default is FASTA format (or raw sequence), but
various contiguous and interleaved formats (e.g., PHYLIP, NEXUS,
PAUP, and FASTAGAP) are also supported. These latter formats contain
alignment information, and this is stored in the sequence record.
¾ The Organism and Sequences form asks for the biological data. On the
Organism page, as the user starts to type the scientific name, the list of
frequently used organism’s scrolls automatically. (Sequin holds
information on the top 800 organisms present in GenBank.). Thus, after
typing a few letters, the user can fill in the rest of the organism name by
clicking on the appropriate item in the list. Sequin now knows the scientific
name, common name, GenBank division, taxonomic lineage, and, most
importantly, the genetic code to use. (For mitochondrial genes, there is a
control to indicate that the alternative genetic code should be used.) For
organisms not on the list, it may be necessary to set the genetic code
control manually. Sequin uses the standard code as the default. The
remainder of the Organism and Sequences form differs depending on the
type of submission.
Organism and Sequences Form

¾ The goal is to go quickly from raw sequence data to an assembled record


that can be viewed, edited, and submitted to your database of choice.

[105]
Bioinformatics Practical Manual K. C. Samal et al.

¾ Advance through the pages that make up each form by clicking on labelled
folder tabs or the Next Page button. After the basic information forms have
been completed and the sequence data imported, Sequin provides a
complete view of your submission, in your choice of text or graphic
format.
¾ At this point, any of the information fields can be easily modified by
double-clicking on any area of the record, and additional biological
annotations can be entered by selecting from a menu.
¾ Sequin has an on-screen Help file that is opened automatically when you
start the program.
¾ Because it is context sensitive, the Help text will change and follow your
steps as you progress through the program. A "Find" function is also
provided.
¾ Sending the Submission - A finished submission can be saved to disk and
E-mailed to one of the databases. It is also a good practice to save
frequently throughout the Sequin session, to make sure nothing is
inadvertently lost. The list at the end of this chapter provides E-mail
addresses and contact information for the three databases.

[106]
Bioinformatics Practical Manual K. C. Samal et al.

Exercise 20
Primer designing
URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi
Theory:
Oligo-nucleotides, also referred to as primers, are short single strands of
nucleic acids that are synthesized from either DNA or RNA in order to bind to a
complementary strand. Primers have a target area where they bind and act as the
starting point for polymerase to extend from, and thus determine what segment of
DNA gets amplified. DNA consists of a double stranded helix. One strand of the
DNA is named the “sense” strand and the other strand is the “anti-sense” strand.
These two DNA strands are complements of each other. During PCR, the
denaturing step will break the hydrogen bonds, separating the two strands. This
allows the primers to anneal to the target region on the DNA during the annealing
step. One primer is designed to anneal to the sense strand and the other primer
needs to bind to the anti-sense strand.
When designing primers for PCR it is necessary to take into consideration
things like: how many primers are needed, the length of the primer, the 5’ and
3’end, the mutation location in primer, the primer melting/annealing temperature,
the G-C content, “primer dimmer” and the distance between the forward and
reverse primers.
Length
The length of the primers need to between 15 and 30 base pairs so that they
are long enough for adequate specificity and short enough for them to anneal to the
DNA template.
The 5’ and 3’end
The primers need to be designed so that the 3’ end of the forward primer will
extend toward the reverse primer. The 3’ end of the reverse primer need to also
extend toward the forward primer. The 3’ ends of the forward and reverse primers

[107]
Bioinformatics Practical Manual K. C. Samal et al.

should be facing each other from opposite DNA strands. This will facilitate the
continued replication of the desired strand of DNA. If, for instance, the 3’ ends do
not elongate in opposite directions (i.e., toward each other) replication will not
work and a PCR product will not be obtained.
Primer Melting Temperature
The Primer Melting Temperature (Tm) is important for the annealing phase
of PCR. Preferred temperatures should be between 50°C and 65°C. The forward
and reverse primer melting temperatures should be no more than 2°C different. To
calculate the Tm, Tm=4°C x (#G’s + C’s in the primer) + 2°C x (# A’s + T’s).
The G-C content
The primer sequence should be relatively high as it has a direct relationship
with the Tm. There should be a base composition of G-C of about 50%-60%. The
3’ end of the primer should finish with at least one G or C to promote efficiency in
annealing due to the stronger bonding.
Distance between the Forward and Reverse
The forward primer and the reverse primer should be between 300 and 2,000
base pairs apart.
Beware of “Primer Dimer”
Primer Dimer is an artefact of PCR where primers bind to each or to
themselves other instead of the template DNA and thus act as their own template to
make a small PCR product and appear faintly on an electrophoresis gel. To avoid
“primer dimers”, be sure there are not many complementary areas in the base
sequence of your forward and reverse primers where the primer strands would be
able to bind to each other instead of the gene.
Things to Avoid
¾ To avoid non-specific binding, design the primers with high annealing
temperatures.
¾ To make sure the primers designed will only bind to the target area submit
the sequence to the BLAST website.
[108]
Bioinformatics Practical Manual K. C. Samal et al.

¾ The MgCl2 and pH conditions can also be adjusted for improved amplified
product.
¾ Watch out for runs of singles bases of G’s, C’s, A’s, and T’s when
developing primers because they can allow mis-priming.
¾ Keep in mind that the more nucleotide bases that the primer is made up of,
the more expensive they are. The shorter the primers are, the less
specificity they have in PCR.
Resources for General Purpose PCR Primer Design
¾ Primer3
¾ Primer3Plus
¾ PrimerZ
¾ PerlPrimer
Aim: - Primer Design on the Web Using Primer3 for STAR-1 GENE in rice
URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi

Procedure
¾ Collect the sequence for which primer has to design, in Fasta format from
NCBI home page.
¾ Open the source web site -http://frodo.wi.mit.edu/cgi-
bin/primer3/primer3_www.cgi
¾ Paste the sequence in fasta format in space of the home page of the
website.
¾ Set the defaults and click ‘pick primers’ to get the result.

[109]
Bioinformatics Practical Manual K. C. Samal et al.

[110]
View publication stats

You might also like