Professional Documents
Culture Documents
net/publication/307478079
CITATIONS READS
0 10,266
2 authors:
All content following this page was uploaded by Kailash Chandra Samal on 31 August 2016.
Printed at:
Bhubaneswari Traders, Nayapalli, Bhubaneswar
Preface
The idea of writing a Bioinformatics Practical Manual originated from
our experience of teaching biotechnology and bioinformatics at Orissa University
of Agriculture and Technology, Bhubaneswar. Odisha. The students are needed a
write-up that was comprehensive enough to cover all major aspects in the field,
technical enough and sufficiently up to date to include most current development
while at the same time being logical and easy to understand. The student interest
motivated us to write this bioinformatics manual to alleviate the problem. It is
written specifically for the biotechnology and bioinformatics students where the
basics of bioinformatics are explained. All key areas of bioinformatics are covered
including biological databases, sequence alignment, gene and promoter prediction,
molecular phylogenetics, structural bioinformatics, genomics, and proteomics. The
manual emphasizes the different practical aspects of bioinformatics. Efforts have
been made to include all essential aspects of Bioinformatics. It is hoped that this
publication will help the teachers, students and technicians to upkeep their
practical knowledge on various dimension of Bioinformatics.
We are grateful to Prof. Manoranjan Kar, Vice Chancellor for his
encouragement and valuable guidance in bringing out this publication. The
constant encouragement and guidance of Prof. B.K. Mishra, Dean, College of
Agriculture for preparation of this publication is duly acknowledged. The help of
the ICAR in granting financial assistance for bringing out this publication is
gratefully acknowledged.
Message
Economic growth and development in India continues to be propelled by
growth in agriculture and allied sectors. This can only be done through
technological advancements and competent human resource to serve the needs of
farmers. Today, the agricultural production through most conventional science and
technology innovations has reached a plateau. Therefore, there is a need to break
the plateau. Thus to put the country’s agricultural growth on fast track,
development of cutting edge technologies such as Biotechnology and
Bioinformatics are the need of the hour. Biotechnology is based on techniques
involving genes, genomes, nucleic acids and other related macro and micro
biomolecules. Bioinformatics apply computer based information technology for
storage, retrieval and analysis of vast databases being generated on genes,
genomics and nucleic acids.
I am delighted that the Department of Agricultural Biotechnology, College
of Agriculture, OUAT is going to publish “Bioinformatics Practical Manual” for
UG, PG and Ph. D. students of agriculture. This publication will strengthen the
knowledge of students, researchers and faculty members on various techniques in
the areas of bioinformatics. I am confident that this manual will be very helpful for
the students, researchers and faculty members.
Content
Chapter / Page
Particulars
Exercise No
Chapter 1 Biological background 1
Chapter 2 Scope and application of bioinformatics 16
Chapter 3 Databases and its structure 28
Chapter 4 Biological database 33
Chapter 5 Database retrieval system 47
Chapter 6 Cataloging biological database 49
Chapter 7 Pairwise Sequence Alignment 54
Chapter 8 Multiple sequence alignment 57
Chapter 9 Practical exercises 58
Exercise 1. Making search for the scientific literature and sequences 58
Exercise 2. Characterization of a known Gene 61
Exercise 3. Finding out open reading frames (ORF) 65
Exercise 4. Translating an unknown DNA Sequence 68
Exercise 5. Identifying a gene using BLAST program 71
Exercise 6. Finding Domains in Protein Sequences 74
Exercise 7. Nucleotide BLAST (BLASTn) 76
Exercise 8. Protein BLAST (Blastp) 79
Exercise 9. Translated BLAST (Blastx) 81
Exercise 10. tBLASTX 83
Exercise 11. Position Specific Interacted BLAST (PSI-BLAST) 85
Exercise 12. FASTA 88
Exercise 13. Editing and analyzing multiple sequence alignment using 91
Jalview
Exercise 14. Making multiple alignment with T-coffee 93
Exercise 15. Online Mendelian Inheritance in Man (OMIM) 96
Exercise 16 Protein Structure Database 98
Exercise 17. Depositing sequences in database 99
Exercise 18. Submitting sequences to Genbank through BankIt 100
Exercise 19. Submitting sequences to Genbank through ‘Sequin’ 103
Exercise 20. Primer Designing 107
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 1
Biological background
Bioinformatics is a tool for providing insight into the structures and
functions of biomolecules: DNA, RNA and proteins. In particular, bioinformatics
deals with the task of understanding information chemically encoded into life that
controls the structural processes ongoing in all living organisms. Bioinformatics is
usually concerned with applying statistical and computational methods to analyze
biological data obtained from wet lab experiments, sequencing projects or the
simulation of protein-protein interactions and how this can help us in understand
the evolution of organisms and biological processes. It also provides an insight of
the Central Dogma of Molecular Biology which characterizes the mappings
between different types of biopolymer (DNA, RNA and protein). Strictly speaking,
the Central Dogma is a list of usual transitions between different biomolecules
within an organism. The theory classifies three styles of `maps' between
biopolymers as follows:
(i) General transfers:
(a) DNA to DNA (DNA replication)
(b) DNA to RNA (transcription)
(c) RNA to protein (translation)
(ii) Special transfers:
(a) RNA to RNA (RNA replication)
(b) RNA to DNA (reverse transcription)
(c) DNA to protein (direct translation)
(iii) Unknown transfers:
(a) Protein to RNA
(b) Protein to DNA
(c) Protein to protein
General transfers are those that happen continuously in organisms, whereas
special transfers happen rarely and often only in special situations. No unknown
transfers are recorded to have happened, although prions, which can manipulate
proteins, may be considered by some to affect protein to protein transfers. DNA to
[1]
Bioinformatics Practical Manual K. C. Samal et al.
RNA to protein as the study of this particular process yields the most practical
applications in areas such as gene therapy. It is important to understand the nature
and function of the biopolymers themselves and also the mechanisms connecting
them and that will be the aim of this introduction.
Biopolymer: DNA
DNA ((Deoxyribonucleic acid) is a helical linear biopolymer. DNA is a helix-
shaped molecule whose constituents are two parallel strands of nucleotides. There
are four types of nucleotides in DNA and they correspond to the letters A (for
adenine), T (thymine), C (cytosine) and G (guanine). DNA is usually represented
by sequences of these four nucleotides. An A on one strand always pairs with a T on the
other opposite strand through two hydrogen bonds, while a C always pairs with a G through
three hydrogen bonds as these nitrogenous bases are complementary to each other. Thus, two
strands are, therefore, complementary to each other and one helix starts from 5’ to 3’
direction while other helix starts from 3’ to 5’ directions. The sequential arrangement of the
individual nucleotides is responsible for giving uniqueness to any individual living form be it
humans, animals, plants, or microbes
[2]
Bioinforrmatics Practical Manual K. C. Saamal et al.
[33]
Bioinformatics Practical Manual K. C. Samal et al.
Biopolymer: RNA
Conversely, the transcribed RNA produced in an organism, though it is
derived from DNA and is structurally similar (however, not helical), varies with
regard to several factors, including time and environmental factors such as
intracellular chemical gradients. The theory of how these extraneous factors affect
the derivation of RNA from DNA is of specific importance to the bioinformatical
projects. RNA preserves the information stored in DNA, as the nucleotides present
in RNA `complement' the nucleotides of DNA, except that adenine in now
complemented with uracil.
Biopolymer: Protein
Proteins are the active agents that govern the metabolic, structural and
signaling processes at work in an individual organism. The translational map
creating protein from mRNA (messenger RNA, the specific type involved in RNA
to protein translation) is mostly determinable by the underlying mRNA and, in fact,
each `codon' (or sequence of three RNA nucleotides) corresponds exactly to an
amino acid (or a start codon, end codon or an untranslated triplet) - the constituent
building blocks of proteins.
Chromosomes and Genes
Each chromosome is a long piece of DNA. Human has 46 chromosomes (2
sets of 23, one set from each parent) and contains 3.12 billion nucleotides (bases).
Genes are just regions on that DNA. Genes are contiguous subparts of single
stranded DNA that are templates for producing proteins. Genes can appear in either
of the DNAs strands. The set of all genes in a given organism is called the genome
for that organism. The function of DNA material between genes is largely
unknown. Certain intergenic regions of DNA (called noncoding) are known to play
a major role in cell regulation, the process that controls the production of proteins
and their possible interactions with DNA. Proteins are produced from DNA using
three operations or transformations called transcription, splicing, and translation.
DNA is capable of replicating itself. The cell machinery that performs that task is
called DNA-polymerase. Biologists call the capability of DNA for replication and
[4]
Bioinformatics Practical Manual K. C. Samal et al.
undergoing the above three (or two) transformations the central dogma. Genes are
transcribed into pre-RNA by a complex ensemble of molecules called RNA-
polymerase. During transcription the nucleotide T (thymine) is substituted by
another one designated by the letter U (for uracil). Pre-RNA can be represented by
alternations of sequence segments called exons and introns. The exons represent
the parts of pre-RNA that will be expressed, that is, translated into proteins. Next
comes the operation called splicing; an ensemble of proteins called the spliceosome
performs it. Splicing consists of concatenating the exons and excising the introns to
form what is known as mRNA, or simply RNA. The final phase, called translation,
is essentially a “table look-up” performed by complex molecules called ribosomes
(an ensemble of RNA and proteins). Translation repeatedly considers a triplet of
consecutive nucleotides in RNA and produces one corresponding amino acid. The
triplet is called a codon. In RNA, there is one special codon called a start codon
and a few others called the stop codons. An open reading frame (ORF) is a
sequence of codons starting with a start codon and ending with an end codon. The
ORF is thus the sequence of nucleotides that is used by the ribosome to produce the
sequence of amino acids that makes up a protein. There are basically 20 amino
acids but, in certain rare situations, others can be added to that list. Since there are
64 different codons and 20 amino acids, the “table look-up” for translating each
codon into an amino acid is redundant in the sense that multiple codons can
produce the same amino acid. The “table” used by nature to perform translation is
called the genetic code. Due to the redundancy of the genetic code, certain
nucleotide changes in DNA may not alter the resulting protein. Once a protein is
produced, it folds (most of the time) into a unique structure in 3D space. In the 3D
representation of a protein, one can distinguish three different types of components:
α-helices, β-sheets and coils. The secondary structure of a protein is its sequence of
amino acids, annotated to distinguish the boundaries of each component: helices,
sheets, and coils. The tertiary structure of a protein is its 3D representation. The
function of a protein is the way it participates with other proteins and molecules in
keeping the cell alive and interacting with its environment. Function is closely
related to tertiary structure. In functional genomics, one studies the function of all
[5]
Bioinformatics Practical Manual K. C. Samal et al.
[6]
Bioinformatics Practical Manual K. C. Samal et al.
For findings a gene, first job is to find long ORFs, examining the longest
ORFs first and putting together a set with minimal overlaps. It is also necessary to
identify potential start codons, with the furthest upstream start codon as the easiest
choice. Then, how do we know that the ORF contains a real gene? The most
definitive way is to match it with a gene known from other species conservation of
a sequence between species strongly suggests that the sequence has a function that
is being conserved by natural selection We compare protein sequences, not DNA,
because protein is more conserved in evolution than DNA The organism’s survival
depends on the protein being functional, which means having the proper amino
acids sequence Since the genetic code is degenerate, many different DNA
sequences will give identical proteins. The protein 3-dimensional structure is even
more conserved, because it is more closely related to enzyme activity than the
amino acid sequence is. However, we don’t have good ways of determining 3-D
structure from a DNA sequence.
[7]
Bioinformatics Practical Manual K. C. Samal et al.
Genetic Code
Proteins are long
chains of amino acids.
There are 20 different
amino acids coded in DNA.
There are only 4 DNA
bases, so you need 3 DNA
bases to code for the 20
amino acids 4 x 4 x 4 = 64
possible 3 base
combinations (codons).
Each codon codes for one
amino acid. Most amino acids have more than one possible codon. Genes start at a
start codon and end at a stop codon. Three codons are stop codons. All genes end at
a stop codon. Start codons are a bit trickier, since they are used in the middle of
genes as well as at the beginning in eukaryotes, ATG is always the start codon,
making Methionine (Met) the first amino acid in all proteins (but in many proteins
it is immediately removed). In prokaryotes, ATG, GTG, or TTG can be used as a
start codon.
Gene Expression
How do you get a protein from a gene? A two-step process (called the
Central Dogma of Molecular Biology). First, the gene has to be copied
(transcribed) into an RNA form. The RNA copy (messenger RNA) is exactly like
the gene itself, except RNA replaces T with U. The RNA is translated into protein
by ribosomes, which are complex RNA/protein hybrid machines. With the help of
transfer RNA molecules, which have one end that matches the 3 base codon and the
other end that is attached to the proper amino acid. The ribosome starts at the start
codon and moves down the messenger RNA, adding one amino acid at a time to the
growing chain. When the ribosome reaches a stop codon, it falls off, releasing the
new protein.
[8]
Bioinformatics Practical Manual K. C. Samal et al.
Transcription (Nucleus):
¾ In the nucleus, an enzyme called DNA helicase causes the twisted DNA
molecule to unwind.
¾ One strand of the DNA is used as the template strand for RNA synthesis.
¾ RNA polymerases begins synthesizing RNA from the DNA template at the
promoter sequence (a sequence that lets the RNA polymerase know where
to begin).
¾ When RNA is synthesized, it is called mRNA (messenger RNA) and leaves
the nucleus and goes to the cytoplasm.
Translation (Cytoplasm):
¾ In the cytoplasm, rRNA (ribosomal RNA), which consists of a small and
large subunit, comes together to provide a site for translation to occur.
¾ tRNA (transfer RNA) is the RNA responsible for picking up which amino
acid should be added to the chain next.
¾ mRNA, rRNA, and tRNA all come together to perform translation.
[9]
Bioinformatics Practical Manual K. C. Samal et al.
¾ mRNA codes for a specific amino acid, tRNA retrieves that amino acid,
and rRNA provides a surface for this to occur.
¾ When tRNA brings back the correct amino acid, a polypeptide chain is
started.
¾ One amino acid is added at a time, and they are connected with peptide
bonds.
¾ When the chain is finished, a protein is formed.
Genetic marker
A DNA polymorphism that can be easily detected by molecular or
biochemical analysis. The marker can be within a gene or in DNA with no known
function. Because DNA segments that lie near each other on a chromosome tend to
be inherited together, markers are often used as indirect ways of tracking the
inheritance pattern of a gene that has not yet been identified, but whose
approximate location is known.
Primer
A short (single strand) oligonucleotide sequence of 10-15 nucleotides used in a
polymerase chain reaction (PCR)
PCR
The development of the polymerase chain reaction (PCR) was a
technological breakthrough by Kary Mullis in 1985 who got the Nobel Prize during
1993. The principle of PCR is very simple. It is based on the function of a copying
enzyme, Taq DNA polymerase (obtained from a bacteria Tharmus acuaticus, a
microbial habitat of hot spring), which is able to synthesize a duplicate molecule of
DNA from a DNA template which is bracketed by the primer. The product of
duplication of the original template DNA becomes a second template for another
round of duplication. Repeated duplications thus lead to an exponential increase in
DNA product accumulation. Even when starting from a single DNA molecule,
detectable amounts of target DNA are generated by PCR in a few hours. DNA
polymerase was first isolated from Thermus aquaticus in 1976. In 1989 Science
[10]
Bioinformatics Practical Manual K. C. Samal et al.
magazine named Taq polymerase as its first "Molecule of the Year". In 1993, Dr.
Mullis was awarded the Nobel Prize for his work with PCR.
DNA fingerprinting
A technique used by scientists to distinguish between individuals of the
same species using only samples of their DNA. It is a technique, by which an
individual can be identified at molecular level. With the advancement of science
and technology VNTR (Variable Number of Tandem Repeats) and STR (Short
Tandem Repeats) analysis has become very popular in forensic laboratories. The
process of DNA fingerprinting was invented by Alec Jeffreys at the University of
Leicester in 1985 in England and was knighted in 1994.
Scientists have chosen repeating sequences in the DNA, which are present
in all individuals on different chromosomes, and are known to vary from individual
to individual. These are used as genetic markers to identify the individual. DNA
fingerprinting technique has been successfully used for identification of plant
species or cultivar, detection seed purity, detection of adulteration in food and seed
and other planting material. This technique also resolves disputes of maternity
/paternity, identification of cultivars or breeding material, forensic wildlife,
protection of farmers’ rights and biodiversity. This remarkable technology provides
positive identification with virtually 100% precision.
DNA profile of an individual is unique. It can never be identical even in
biologically related individuals except for the identical (monozygotic) twins. The
chances of two people having exactly the same DNA profile are 30,000 million to 1
(except for identical twins).
Any biological material such leaf, seed, plant parts in case of plant and a drop of
blood, saliva, semen, and any body part such as bones, tissue, skull, teeth, hair with root in
case of animal and human being.
Molecular markers
Molecular markers in life sciences is defined as a DNA sequence or a
cytogenetic segment or a chromosome fragment or a protein or an enzyme used as
[11]
Bioinformatics Practical Manual K. C. Samal et al.
[13]
Bioinformatics Practical Manual K. C. Samal et al.
[15]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 2
Scope and application of bioinformatics
Bioinformatics is the field of science in which biology, computer science
and information technology merges to form a single discipline. It is the collection,
organization, analysis, presentation and sharing of biological data to solve
biological problems on the molecular level. It is an interdisciplinary scientific field
that develops methods for storing, retrieving, organizing and analyzing biological
data. A major activity in bioinformatics is to develop software tools to generate
useful biological knowledge. Bioinformatics uses many areas of computer science,
statistics, mathematics and engineering to process biological data. Databases and
information systems are used to store and organize biological data. Analyzing
biological data may involve algorithms in artificial intelligence, soft computing,
data mining, image processing, and simulation. The algorithms in turn depend on
theoretical foundations such as discrete mathematics, control theory, system theory,
information theory, and statistics. Commonly used software tools and technologies
in the field include Java, C#, XML, Perl, C, C++, Python, R, SQL, CUDA,
MATLAB, and spreadsheet applications.
The term bioinformatics was coined by Pauline in 1979 for the study of
information processes in biotic systems. The National Center for Biotechnology
Information (NCBI, 2001) defines bioinformatics as “Bioinformatics is the field of
science in which biology, computer science and information technology merges to
form a single discipline. There are three important sub disciplines within
bioinformatics: the development of new algorithms and statistics with which to
access relation among member of large data sets, the analysis and interpretation of
various type of data including nucleotide and amino acid sequences, protein
domain and protein structure, and development and interpretation of tools that
enable efficient access and management of different type of information”.
Bioinformatics is a science discipline that has been emerged in response to
accelerating demand for a flexible and intelligent means of storing, managing and
querying large and complex biological data sets. The ultimate aim of
[16]
Bioinformatics Practical Manual K. C. Samal et al.
engineering company, Genetech was founded. During 1981, 579 genes had been
mapped and mapping by in situ hybridization had become a standard method for
automated DNA sequencing. In 1988, the human genome organization (HUGO)
was founded. This is an international organization of scientist involved in Human
genomic project. In 1989, the first complete genome map was published of bacteria
Himophilus influenza. The following year, human Genome project was started in
1991. A total of 1879 human genes had been mapped. In 1993, Genethon, a human
genomic research center in France produced a physical map of human genome.
Three year later, Genethon published the final version of the Human genetic map
which included data from patients, preclinical and clinical trials and metabolic
pathway of numerous species.
Challenges:
The greatest challenge facing the molecular biology community today is to
make sense of the wealth of data that has been produced by the genome sequencing
projects. Cells have central core called nucleus, which is storehouse of an
important molecular known as the genome. Gene are specific region of the genome
(about 1%) spread through genome, sometime contiguous, many times non
contiguous. RNA similarly contain information, their major purpose is to copy
information from DNA selectively and to bring it out of the nucleus for its use.
Protein is made of amino acids, which are twenty in count. The gene, regions of the
DNA in the nucleus of the cell, is copied into the RNA and RNA travels to protein
production sites and is translated into protein is the Central dogma of molecular
biology.
Difference between bioinformatics and computational biology
Both bioinformatics and computational biology are computers and biology.
Biologists specialize in use of computational tools and systems to answer problems
of biology are bioinformaticians. Computer scientist, mathematicians, statisticians
and engineers who specialize in developing theories, algorithms and technique for
such tools and systems are computational biologists. The actual process of
[18]
Bioinformatics Practical Manual K. C. Samal et al.
[19]
Bioinformatics Practical Manual K. C. Samal et al.
[20]
Bioinformatics Practical Manual K. C. Samal et al.
[21]
Bioinformatics Practical Manual K. C. Samal et al.
[22]
Bioinformatics Practical Manual K. C. Samal et al.
[24]
Bioinformatics Practical Manual K. C. Samal et al.
Bioinformatics in India
As per the recent study India will be a potential star in the field of
bioscience. In the coming years after considering the factors like bio-diversity,
human resources, and infra-structure facilities and governments initiatives.
Bioinformatics has been emerged out of the inputs from several different
areas such as biology, biochemistry, biophysics, molecular biology, biostatics and
computer science. Specially designed algorithms and organized database is the core
of all informatics operations. The requirements for such an activity make heavy and
high level demands on both the hardware and software capabilities. This sector is
the quickest growing field in the country. The vertical growth is because of the
linkage between IT and biotechnology, spurred by the human genome project. The
promising startups are already there in Bangalore, Hyderabad, Pune, Chennai and
Delhi. There are over 200 companies functioning in these places. IT majors such
Intel, IBM, Wipro are getting into this segments spurred by the promises in
technological developments.
Limitations
Having recognized the power of bioinformatics, it is also important to
realize its limitations and avoid over-reliance on and over-expectation of
bioinformatics output. In fact, bioinformatics has a number of inherent limitations.
In many ways, the role of bioinformatics in genomics and molecular biology
[26]
Bioinformatics Practical Manual K. C. Samal et al.
[27]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 3
Databases and its structure
One of the hallmarks of modern genomic research is the generation of
enormous amounts of raw sequence data. As the volume of genomic data grows,
sophisticated computational methodologies are required to manage the data deluge.
Thus, the very first challenge in the genomics era is to store and handle the
staggering volume of information through the establishment and use of computer
databases. The development of databases to handle the vast amount of molecular
biological data is thus a fundamental task of bioinformatics. This chapter
introduces some basic concepts related to development and management of
databases. Biological databases are libraries of life sciences information, collected
from scientific experiments, published literature, high-throughput experiment
technology, and computational analyses. They contain information from research
areas including genomics, proteomics, metabolomics, microarray gene expression,
and phylogenetics. Information contained in biological databases includes gene
function, structure, localization (both cellular and chromosomal), clinical effects of
mutations as well as similarities of biological sequences and structures.
What is a database?
A database is a computerized archive used to store and organize data in
such a way that information can be retrieved easily via a variety of search criteria.
Databases are composed of computer hardware and software for data management.
The chief objective of the development of a database is to organize data in a set of
structured records to enable easy retrieval of information. Each record, also called
an entry, should contain a number of fields that hold the actual data items, for
example, fields for names, phone numbers, addresses, dates. To retrieve a particular
record from the database, a user can specify a particular piece of information,
called value, to be found in a particular field and expect the computer to retrieve
the whole data record. This process is called making a query. Although data
retrieval is the main purpose of all databases, biological databases often have a
higher level of requirement, known as knowledge discovery, which refers to the
identification of connections between pieces of information that were not known
[28]
Bioinformatics Practical Manual K. C. Samal et al.
when the information was first entered. For example, databases containing raw
sequence information can perform extra computational tasks to identify sequence
homology or conserved motifs. These features facilitate the discovery of new
biological insights from raw data.
Organization of databases:
Databases can be constructed either as flat files, relational, or object
oriented. Flat files are simple text files and lack any form of organization to
facilitate information retrieval by computers. Relational databases organize data as
tables and search information among tables with shared features. Object-oriented
databases organize data as objects and associate the objects according to
hierarchical relationships.
(a) Flat file database:
Originally, databases all used a flat file format, which is a long text file that
contains many entries separated by a delimiter, a special character such as a vertical
bar (|). Within each entry are a number of fields separated by tabs or commas.
Except for the raw values in each field, the entire text file does not contain any
hidden instructions for computers to search for specific information or to create
reports based on certain fields from each record. The text file can be considered a
single table. Thus, to search a flat file for a particular piece of information, a
computer has to read through the entire file, an obviously inefficient process. This
is manageable for a small database, but as database size increases or data types
become more complex, this database style can become very difficult for
information retrieval. Indeed, searches through such files often cause crashes of the
entire computer system because of the memory-intensive nature of the operation.
To facilitate the access and retrieval of data, sophisticated computer software
programs for organizing, searching, and accessing data have been developed. They
are called database management systems. These systems contain not only raw data
records but also operational instructions to help identify hidden connections among
data records. The purpose of establishing a data structure is for easy execution of
the searches and to combine different records to form final search reports.
Depending on the types of data structures, these database management systems can
[29]
Bioinformatics Practical Manual K. C. Samal et al.
be classified into two types: relational database management systems and object-
oriented database management systems. Consequently, databases employing these
management systems are known as relational databases or object-oriented
databases, respectively.
number and title. A relational database is also created to store the same
information, in which the data are structured as a number of tables. In each table,
data that fit a particular criterion are grouped together. Different tables can be
linked by common data categories, which facilitate finding of specific information
Relational database
Table A Table B Table C
Student Student Course
Name State Course# Course#
No# No# name
1 Dhawale Rahmi Maharashtra 1 PPT-301 PPT-301 Plant pathology
For example, if one is to ask the question, which courses are students from
the state ‘Maharashtra’ taking? The database will first find the field for “State” in
Table A and look up for ‘Maharashtra’. This returns students 1 and 5. The student
numbers are co listed in Table B, in which students 1 and 5 correspond to PPT-301
and ABT-517, respectively. The course names listed by course numbers are found
in Table C. By going to Table C, exact course names corresponding to the course
numbers can be retrieved. A final report is then given showing that the students of
‘Maharashtra’ are taking the courses ‘Plant pathology’ and ‘Microbiology’.
However, executing the same query through the flat file requires the computer to
read through the entire text file word by word and to store the information in a
temporary memory space and later mark up the data records containing the word
‘Maharashtra’. This is easily accomplishable for a small database. To perform
[31]
Bioinformatics Practical Manual K. C. Samal et al.
queries in a large database using flat files obviously becomes enormous task for the
computer system.
Object-Oriented Databases
One of the problems with relational databases is that the tables used do not
describe complex hierarchical relationships between data items. To overcome the
problem, object-oriented databases have been developed that store data as objects.
In an object-oriented programming language, an object can be considered as a unit
that combines data and mathematical routines that act on the data. The database is
structured such that the objects are linked by a set of pointers defining
predetermined relationships between the objects. Searching the database involves
navigating through the objects with the aid of the pointers linking different objects.
Programming languages like C++ are used to create object-oriented databases. The
object-oriented database system is more flexible; data can be structured based on
hierarchical relationships. By doing so, programming tasks can be simplified for
data that are known to have complex relationships, such as protein structure data.
In this case, three objects are constructed and are linked by pointers shown
as arrows. Finding specific information relies on navigating through the objects by
way of pointers. For simplicity, some of the pointers are omitted. this type of
database system lacks the rigorous mathematical foundation of the relational
databases. There is also a risk that some of the relationships between objects maybe
misrepresented. Some current databases have therefore incorporated features of
both types of database programming, creating the object–relational database
management system. The above students’ course information can be used to
construct an object-oriented database. Three different objects can be designed:
student object, course object, and state object. Their interrelations are indicated by
lines with arrows. To answer the same question – which courses are students from
‘Maharashtra’ taking – one simply needs to start from ‘Maharashtra’ in the state
object, which has pointers that lead to students, 1 and 5 in the student object.
Further pointers in the student object point to the course each of the two students is
taking. Therefore, a simple navigation through the linked objects provides a final
report.
[32]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 4
Biological Databases
Based on their content, biological databases are divided into primary,
secondary, and specialized databases. Primary databases simply archive sequence
or structure information; secondary databases include further analysis on the
sequences or structures. Specialized databases cater to a particular research interest.
Current biological databases use all three types of database structures: flat
files, relational, and object oriented. Despite the obvious drawbacks of using flat
files in database management, many biological databases still use this format. The
justification for this is that this system involves minimum amount of database
design and the search output can be easily understood by working biologists.
(I) Primary Databases
There are three major public sequence databases that store raw nucleic acid
sequence data produced and submitted by researchers worldwide: GenBank,
European Molecular Biology Laboratory (EMBL) database and DNA Data Bank of
Japan (DDBJ), which are all freely available on the Internet. Most of the data in the
databases are contributed directly by authors with a minimal level of annotation. A
small number of sequences, especially those published in the 1980s, were entered
manually from published literature by database management staff. Presently,
sequence submission to GenBank, EMBL, or DDBJ is a precondition for
publication in most scientific journals to ensure the fundamental molecular data to
be made freely available. These three public databases closely collaborate and
exchange new data daily. They together constitute the International Nucleotide
Sequence Database Collaboration. This means that by connecting to any one of the
three databases, one should have access to the same nucleotide sequence data.
Although the three databases all contain the same sets of raw data, each of the
individual databases has a slightly different kind of format to represent the data.
Fortunately, for the three-dimensional structures of biological macromolecules,
there is only one centralized database, the PDB. This database archives atomic
[33]
Bioinformatics Practical Manual K. C. Samal et al.
[34]
Bioinformatics Practical Manual K. C. Samal et al.
plant, fungal, and algal sequences; PRI for primate sequences; MAM for non-
primate mammalian sequences; BCT for bacterial sequences; and EST for EST
sequences. Next to the division is the date when the record was made public (which
is different from the date when the data were submitted). The following line,
“DEFINITION,” provides the summary information for the sequence record
including the name of the sequence, the name and taxonomy of the source
organism if known, and whether the sequence is complete or partial. This is
followed by an accession number for the sequence, which is a unique number
assigned to a piece of DNA when it was first submitted to GenBank and is
permanently associated with that sequence. This is the number that should be cited
in publications. It has two different formats: two letters with five digits or one letter
with six digits. For a nucleotide sequence that has been translated into a protein
sequence a new accession number is given in the form of a string of alphanumeric
characters. In addition to the accession number, there is also a version number and
a gene index (gi) number. The purpose of these numbers is to identify the current
version of the sequence. If the sequence annotation is revised at a later date, the
accession number remains the same, but the version number is incremented as is
the gi number. A translated protein sequence also has a different gi number from
the DNA sequence it is derived from.
The next line in the Header section is the “ORGANISM” field, which
includes the source of the organism with the scientific name of the species and
sometimes the tissue type. Along with the scientific name is the information of
taxonomic classification of the organism. Different levels of the classification are
hyperlinked to the NCBI taxonomy database with more detailed descriptions. This
is followed by the “REFERENCE” field, which provides the publication citation
related to the sequence entry. The REFERENCE part includes author and title
information of the published work (or tentative title for unpublished work). The
“JOURNAL” field includes the citation information as well as the date of sequence
submission. The citation is often hyperlinked to the PubMed record for access to
the original literature information. The last part of the Header is the contact
information of the sequence submitter.
[35]
Bioinformatics Practical Manual K. C. Samal et al.
Features section
The “Features” section includes annotation information about the gene and
gene product, as well as regions of biological significance reported in the sequence,
with identifiers and qualifiers. The “Source” field provides the length of the
sequence, the scientific name of the organism, and the taxonomy identification
number. Some optional information includes the clone source, the tissue type and
the cell line. The “gene” field is the information about the nucleotide coding
sequence and its name. For DNA entries, there is a “CDS” field, which is
information about the boundaries of the sequence that can be translated into amino
acids. For eukaryotic DNA, this field also contains information of the locations of
exons and translated protein sequences are entered. The third section of the flat file
is the sequence itself starting with the label “ORIGIN.” The format of the sequence
display can be changed by choosing options at a Display pull-down menu at the
upper left corner. For DNA entries, there is a BASE COUNT report that includes
the numbers of A, G, C, and T in the sequence. This section, for both DNA and
protein sequences, ends with two forward slashes (the “//” symbol). In retrieving
DNA or protein sequences from GenBank, the search can be limited to different
fields of annotation such as “organism,” “accession number,” “authors,” and
“publication date.” Alternatively, a number of search qualifiers can be used, each
defining one of the fields in a GenBank file. The qualifiers are similar to but not the
same as the field tags in PubMed. For example, in GenBank, [GENE] represents
field for gene name, [AUTH] for author name, and [ORGN] for organism name.
Alternative Sequence Formats FASTA
In addition to the GenBank format, there are many other sequence formats.
FASTA is one of the simplest and the most popular sequence formats because it
contains plain sequence information that is readable by many bioinformatics
analysis programs. It has a single definition line that begins with a right angle
bracket (>) followed by a sequence name. Sometimes, extra information such as gi
number or comments can be given, which are separated from the sequence name by
a “|” symbol. The extra information is considered optional and is ignored by
sequence analysis programs. The plain sequence in standard one-letter symbols
[37]
Bioinformatics Practical Manual K. C. Samal et al.
starts in the second line. Each line of sequence data is limited to sixty to eighty
characters in width. The drawback of this format is that much annotation
information is lost.
[38]
Bioinformatics Practical Manual K. C. Samal et al.
[39]
Bioinformatics Practical Manual K. C. Samal et al.
[40]
Bioinformatics Practical Manual K. C. Samal et al.
FEATURES Location/Qualifiers
source 1..348
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="3"
/map="3q21-q24"
Protein 1..348
/product="rhodopsin"
/note="opsin 2, rod pigment; opsin-2"
/calculated_mol_wt=38762
Region 2..37
/region_name="Rhodopsin_N"
/note="Amino terminal of the G-protein receptor rhodopsin;
pfam10413"
/db_xref="CDD:150994"
Site 37..61
/site_type="transmembrane region"
CDS 1..348
/gene="RHO"
/gene_synonym="CSNBAD1; OPN2; RP4"
/coded_by="NM_000539.3:96..1142"
/db_xref="CCDS:CCDS3063.1"
/db_xref="GeneID:6010"
/db_xref="HGNC:10012"
/db_xref="HPRD:01584"
/db_xref="MIM:180380"
ORIGIN
1 mngtegpnfy vpfsnatgvv rspfeypqyy laepwqfsml aaymfllivl gfpinfltly
61 vtvqhkklrt plnyillnla vadlfmvlgg ftstlytslh gyfvfgptgc nlegffatlg
121 geialwslvv laieryvvvc kpmsnfrfge nhaimgvaft wvmalacaap plagwsryip
181 eglqcscgid yytlkpevnn esfviymfvv hftipmiiif fcygqlvftv keaaaqqqes
241 attqkaekev trmviimvia flicwvpyas vafyifthqg snfgpifmti paffaksaai
301 ynpviyimmn kqfrncmltt iccgknplgd deasatvskt etsqvapa
//
Figure: Swissprot protein database (storing information of Rhodopsin protein)
[41]
Bioinformatics Practical Manual K. C. Samal et al.
TrEMBL:
To accommodate the growing influx of protein sequences without
compromising the quality of SWISS-PROT, the protein translations of the EMBL
nucleotide sequences that have not been properly curated by human annotators are
put into a supplemental database, TrEMBL (Translated EMBL,
http://www.expasy.org/sprot). This database serves as a kind of purgatory (or a
“halfway house”) for SWISS-PROT. Each TrEMBL entry is assigned a SWISS-
PROT-type accession number that would stay with it when the sequence is finally
manually checked and accepted into SWISS-PROT. To simplify curation, TrEMBL
entries are even formatted in the SWISS-PROT style. However, one should be alert
to the fact that TrEMBL entries are generated automatically, so their quality is not
guaranteed and their annotations should not be considered as solid as those of
authentic SWISS-PROT entries.
PIR:
The PIR (Protein Information Resource, http://pir.georgetown.edu) database
is an outgrowth of the Protein Sequence Database, originally created by Margaret
Dayhoff, and is currently maintained at the Georgetown University in collaboration
with Munich Information Center for Protein Sequences (MIPS,
http://mips.gsf.de/proj/protseqdb) in Munich, Germany and the Japanese
International Protein Information Database. While technically also a curated
database, PIR is far less rigorous than SWISS-PROT in maintaining the quality of
its annotations The advantage of PIR, however, is in its hierarchical organization.
The definitions of protein family and super-family employed in PIR are far more
narrow than those used in most of the other protein databases, particularly motif-
based and structure-based ones. Thus, PIR super-families are often composed of
very similar proteins, which may be treated by other databases as members of the
same family. As a result, more distant relations between proteins (the least trivial
and therefore the most interesting ones) are often not represented in PIR at all.
Recently, PIR has intensified its protein classification efforts with the creation of
iProClass (http://pir.georgetown.edu/iproclass, a protein classification database.
[42]
Bioinformatics Practical Manual K. C. Samal et al.
[43]
Bioinformatics Practical Manual K. C. Samal et al.
Educators and students in genetics and cellular biology comprise another large
community that SGD serves, as do bioinformatics scientists who perform genome-
wide computational analyses, for either yeast or comparative studies.
ACeDB:
ACeDB is a genome database system started in 1989 by Jean Thierry-Mieg
(CNRS, Montpellier) and Richard Durbin (Sanger Institute). It was originally
developed for the Caenorhabditis elegans genome project from which its name was
derived: A C. elegans DataBase. However the tools in it have been generalized to
be much more flexible and the same software is now used for many different
genomic databases from bacteria to fungi to plants to man.
Arabidopsis Information Resource (TAIR):
The Arabidopsis Information Resource (TAIR) collects information and
maintains a database of genetic and molecular biology data for Arabidopsis
thaliana, a widely used model plant. TAIR is managed by the nonprofit Phoenix
Bioinformatics Corporation and is supported through institutional, lab and personal
subscriptions. Prior funding was provided by the National Science Foundation.
The data in TAIR can be searched, viewed using our GBrowse or interactive
SeqViewer genome browsers.
FlyBase:
FlyBase is an online bioinformatics database and the primary repository of
genetic and molecular data of the extensively studied species and model organism,
Drosophila melanogaster. A wide range of data are presented in different formats.
Information in FlyBase originates from a variety of sources ranging from large-
scale genome projects to the primary research literature. These data types include
mutant phenotypes, molecular characterization of mutant alleles and other
deviations, cytological maps, wild-type expression patterns, anatomical images,
transgenic constructs and insertions, sequence-level gene models and molecular
classification of gene product functions. Query tools allow navigation of FlyBase
through DNA or protein sequence, by gene or mutant name, or through functional,
phenotypic, and anatomical data. The database offers several different query tools
in order to provide efficient access to the data available and facilitate the discovery
[44]
Bioinformatics Practical Manual K. C. Samal et al.
of significant relationships within the database. The FlyBase project is carried out
by a consortium of Drosophila researchers and computer scientists at Harvard
University and Indiana University in the United States, and University of
Cambridge in the United Kingdom.
Gramene:
The Gramene (http://www.gramene.org/) is a curated, open-source,
integrated data resource for comparative functional genomics in crops and model
plant species. The Gramene database became a resource for major model and crop
plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in
addition to several species of rice. Gramene began with the addition of an Ensembl
genome browser and has expanded in the last decade to become a robust resource
for plant genomics hosting a wide array of data sets including quantitative trait loci
(QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm,
literature, ontologies and a fully-structured markers and sequences database
integrated with genome browsers and maps from various published studies
(genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web
services including a Distributed Annotation Server (DAS), BLAST and a public
MySQL database. Twice a year, Gramene releases a major build of the database
and makes interim releases to correct errors or to make important updates to
software and/or data. Gramene currently hosts annotated whole genomes in over
two dozen plant species and partial assemblies for almost a dozen wild rice species.
Online Mendelian Inheritance in Man (OMIM)
Online Mendelian Inheritance in Man (OMIM) is a timely, authoritative
compendium of bibliographic material and observations on inherited disorders and
human genes. It is the continuously updated. Curation of the database and editorial
decisions take place at The Johns Hopkins University School of Medicine. OMIM
provides authoritative free text overviews of genetic disorders and gene loci that
can be used by clinicians, researchers, students, and educators. In addition, OMIM
has many rich connections to relevant primary data resources such as bibliographic,
sequence, and map information.
[45]
Bioinformatics Practical Manual K. C. Samal et al.
[46]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 5
Database retrieval system
Databases are fundamental to modern biological research, especially to
genomic studies. The goal of a biological database is twofold: information retrieval
and knowledge discovery.
Entrez:
The Entrez (http://www.ncbi.nlm.nih.gov/) is a powerful federated search
engine, or web portal that allows users to search for scientific information, DNA,
RNA and protein sequences, structures, and bibliographic references. It is a part of
the National Library of Medicine (NLM), which is itself a department of the
National Institutes of Health (NIH), which in turn is a part of the United States
Department of Health and Human Services. The name "Entrez" (a greeting
meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the
public to search the content available from the NLM.
Entrez Global Query is an integrated search and retrieval system that
provides access to all databases simultaneously with a single query string and user
interface. Entrez can efficiently retrieve related sequences, structures, and
references. The Entrez system can provide views of gene and protein sequences and
chromosome maps. Some textbooks are also available online through the Entrez
system. The databases accessible through Entrez are among the most integrated
databases. Effective information retrieval involves the use of Boolean operators
(AND, OR, NOT). Entrez has additional user-friendly features to help conduct
complex searches. One such option is to use Limits, Preview/Index, and History to
narrow down the search space. Alternatively, one can use NCBI-specific field
qualifiers to conduct searches. To retrieve sequence information from NCBI
GenBank, an understanding of the format of GenBank sequence files is necessary.
It is also important to bear in mind that sequence data in these databases are less
than perfect. There are sequence and annotation errors. Biological databases are
also plagued by redundancy problems. There are various solutions to correct
[47]
Bioinformatics Practical Manual K. C. Samal et al.
annotation and reduce redundancy, for example, merging redundant sequences into
a single entry or store highly redundant sequence.
Sequence retrieval system
Sequence retrieval system (SRS; available at http://srs6.ebi.ac.uk/) is a
retrieval system maintained by the EBI, which is comparable to NCBI Entrez. It is
not as integrated as Entrez, but allows the user to query multiple databases
simultaneously, another good example of database integration. It also offers direct
access to certain sequence analysis applications such as sequence similarity
searching and Clustal sequence alignment. Queries can be launched using “Quick
Text Search” with only one query box in which to enter information. There are also
more elaborate submission forms, the “Standard Query Form” and the “Extended
Query Form.” The standard form allows four criteria (fields) to be used, which are
linked by Boolean operators. The extended form allows many more diversified
criteria and fields to be used. The search results contain the query sequence and
sequence annotation as well as links to literature, metabolic pathways, and other
biological databases.
[48]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 6
Cataloging biological databases
Primary nucleotide sequence database
The Primary Nucleotide Sequence Database consists of the following
databases.
¾ DNA Data Bank of Japan (National Institute of Genetics)
¾ European Nucleotide Archive (European Bioinformatics Institute)
¾ GenBank (National Center for Biotechnology Information)
The three databases, DDBJ (Japan), GenBank (USA) and European
Nucleotide Archive (Europe), are repositories for nucleotide sequence data from
all organisms. All three databases accept nucleotide sequence submissions, and
then exchange new and updated data on a daily basis to achieve optimal
synchronization between them. These three databases are primary databases, as
they house original sequence data.
Meta database:
These databases of databases collect data from different sources and make
them available in new and more convenient form, or with an emphasis on a
particular disease or organism.
¾ BioGraph - A knowledge discovery service based on the integration of
more than 20 heterogeneous databases
¾ Bioinformatic Harvester - Integrating 26 major protein/gene resources.
¾ Neuroscience Information Framework (University of California San
Diego) - Integrates hundreds of neuroscience relevant resources, many are
listed below.
¾ Entrez (National Center for Biotechnology Information)
¾ Enzyme Portal Integrates enzyme information such as small-molecule
chemistry, biochemical pathways and drug compounds. (European
Bioinformatics Institute)
¾ MetaBase (KOBIC) - A user contributed database of biological databases.
[49]
Bioinformatics Practical Manual K. C. Samal et al.
[50]
Bioinformatics Practical Manual K. C. Samal et al.
¾ PDBsum
Protein model databases:
¾ Swiss-model Server and Repository for Protein Structure Models
¾ ModBase Database of Comparative Protein Structure Models
(Sali Lab, UCSF)
¾ Protein Model Portal (PMP) Meta database that combines several
databases of protein structure models (Biozentrum, Basel, Switzerland)
RNA databases
¾ Rfam, a database of RNA families
¾ miRBase, the microRNA database
¾ snoRNAdb, a database of snoRNAs
¾ lncRNAdb, a database of lncRNAs
¾ piRNAbank, a database of piRNAs
¾ GtRNAdb, a database of genomic tRNAs
¾ SILVA, a database of ribosomal RNAs
¾ RDP, the Ribosomal Database Project
Carbohydrate structure databases
¾ EuroCarbDB, A repository for both carbohydrate sequences/structures and
experimental data.
Protein-protein interactions:
¾ BIND Biomolecular Interaction Network Database
¾ BioGRID, A General Repository for Interaction Datasets (Samuel
Lunenfeld Research Institute)
¾ CCSB Interactome
¾ DIP Database of Interacting Proteins
¾ IntAct molecular interaction database: a central, standards-compliant
repository of molecular interactions, including protein–protein, protein–
small molecule and protein–nucleic acid interactions.
¾ NetPro
¾ STRING: STRING is a database of known and predicted protein-protein
interactions. (EMBL)
[52]
Bioinformatics Practical Manual K. C. Samal et al.
[53]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 7
Pairwise Sequence Alignment
In this document we illustrate how to perform pairwise sequence
alignments using the Biostrings package through the use of the pairwise Alignment
function. This function aligns a set of pattern strings to a subject string in a global,
local, or overlap (ends-free) fashion with or without an e gaps using either a fixed
or quality-based substitution scoring scheme.
Each of these pairwise sequence alignment problems is solved by
maximizing the alignment score. An alignment score is determined by the type of
pairwise sequence alignment (global, local, overlap), which sets the ranges for the
substrings; the substitution scoring scheme, which sets the distance between
aligned characters; and the gap penalties, which is divided into opening and
extension components. The optimal pairwise sequence alignment is the pairwise
sequence alignment with the largest score for the specied alignment type,
substitution scoring scheme, and gap penalties.
There are 3 methods for pairwise sequence alignment:
1) dot plot, 2) global alignment, and 3) local alignment.
Dot Plot
The simplest method is the dot plot. One sequence is written out
horizontally, and the other sequence is written out vertically, along the top and side
of an m x n grid, where m and n are the lengths of the two sequences. A dot is
placed in a cell in the grid wherever the two sequences match. A diagonal line in
the grid visually shows where the two sequences have sequence identity. Web-
based dot plot implementations can be found here:
http://www.vivo.colostate.edu/molkit/dnadot/ – for nucleotide sequence only
http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher - for both nucleic
acid & protein sequence with standard EMBOSS scoring matrices
[54]
Bioinformatics Practical Manual K. C. Samal et al.
Global Alignment:
The algorithm published by Needleman and Wunsch in 1970 for alignment
of two protein sequences was the first application of dynamic programming to
biological sequence analysis. The Needleman-Wunsch algorithm finds the best-
scoring global alignment between two sequences. Global alignments are most
useful when the two sequences being compared are of similar lengths, and not too
divergent.
Local Alignment:
Real life is often complicated, and we observe that genes, and the proteins
they encode, have undergone exon-shuffling, recombination, insertions, deletions,
and even fusions. Many proteins exhibit modular architecture. In searching
databases for similar sequences, it is useful to find sequences that have similar
domains or functional motifs. Smith & Waterman (1981) published an application
of dynamic programming to find optimal local alignments. The algorithm is similar
to Needleman-Wunsch, but negative cell values are reset to zero, and the trace back
procedures starts from the highest scoring cell.
Scoring Matrices
The Needleman-Wunsch and Smith-Waterman algorithms require a
scoring matrix. The scoring matrix assigns a positive score for a match, and a
penalty for a mismatch. For nucleotide sequence alignments, the simplest scoring
matrix awards +1 for a match, and -1 for a mismatch. The blastn algorithm at NCBI
scores +5 for a match and -4 for a mismatch. These scoring matrices treat all
mutations (mismatches) equally. In reality, transitions (pyrimidine -> pyrimidine
and purine -> purine) occur much more frequently than transversions (pyrimidine -
> purine and vice versa). For aligning non-protein coding DNA sequences, a
[55]
Bioinformatics Practical Manual K. C. Samal et al.
Gap penalty
Sequence alignments usually require insertion of gaps, reflecting insertion
or deletion mutations. If a nucleotide or amino acid in one sequence is aligned to a
gap in the target sequence, then this should be penalized as a mismatch. However,
gaps at the ends of sequences should perhaps not incur any penalty. Moreover, a
single insertion or deletion mutation could result in a contiguous gap of multiple
residues. Therefore, a single gap that is 3 residues long should incur less penalty
than 3 different gaps, of one residue each.
[56]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 8
Multiple sequence alignment
Multiple Sequence Alignment (MSA) is a sequence alignment of three or
more biological sequences, generally protein, DNA, or RNA. In many cases, the
input set of query sequences are assumed to have an evolutionary relationship by
which they share a lineage and are descended from a common ancestor. From the
resulting MSA, sequence homology can be inferred and phylogenetic analysis can
be conducted to assess the sequences' shared evolutionary origins. Visual
depictions of the alignment as in the image at right illustrate mutation events such
as point mutations (single amino acid or nucleotide changes) that appear as
differing characters in a single alignment column, and insertion or deletion
mutations (indels or gaps) that appear as hyphens in one or more of the sequences
in the alignment. Multiple sequence alignment is often used to assess sequence
conservation of protein domains, tertiary and secondary structures, and even
individual amino acids or nucleotides.
Multiple sequence alignment also refers to the process of aligning such a
sequence set. Because three or more sequences of biologically relevant length can
be difficult and are almost always time- consuming to align by hand, computational
algorithms are used to produce and analyze the alignments. MSAs require more
sophisticated methodologies than pairwise alignment because they are more
computationally complex. Most multiple sequence alignment programs use
heuristic methods rather than global optimization because identifying the optimal
alignment between more than a few sequences of moderate length is prohibitively
computationally expensive.
[57]
Bioinformatics Practical Manual K. C. Samal et al.
Chapter 9
Practical Exercises
Exercise 1:
Making search for the scientific literature and sequences
Theory:
The most fundamental skill in bioinformatics is the ability to carry out an
efficient and comprehensive search of the scientific literature to find out what is
known about a specific subject. All of you are familiar with web search engines and
while they can be useful, they also turn up many items that have never undergone
the test of scientific peer review. Thus, this exercise is NOT a search of the World
Wide Web, but will introduce you to search the published scientific literature using
a database such as MEDLINE, Biological Abstracts or Chemical Abstracts. This
exercise will focus on the ‘Entrez browser’ entry to the national library of medicine
database MEDLINE (PubMed).
PubMed is a database service of the National Library of Medicine that cites
articles from MEDLINE and life science journals.
Procedure:
[58]
Bioinforrmatics Practical Manual K. C. Saamal et al.
1. To browse the World Wide Web, just open your favourite internet browser
(Internet eex
xplorer, Google chrome or Mozilla Fireffo
ox).
2. In the address bar, type the URL (http://www.ncbi.nlm.nih.gov/pubmed) and
press ‘Enter key’ on your keyboard.
The Homee page of your
T y site (here
( PubM
Med) as shhown below
w will apppear. A
search winndow and a text box will be diisplayed where
w you will
w type few
f key
w
words releevant to youur search topic.
t
To search scientific or
T o bibliogrraphic literrature in PuubMed, typpe key worrd(s) or
p
phrase(s) into
i the query box (e.g., a subjeect, author and/or jouurnal).
4. For any entry in the Results list, click associated author names.
Search details, located in the right navigation column, provide information on
how PubMed ran a search. PubMed looks first for the entire word or phrase as
a MeSH term, then for journal titles, then authors. PubMed also searches “All
Fields” for the term. Search details shows how PubMed maps terms to MeSH
headings and subheadings. Changes to the search may be made in the Details
box; click Search to run the updated search strategy
5. Save what you like to your hard drive by choosing your browser’s File: Save
as option.
[60]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise-2:
Characterization of a Known Gene
URL: (http://www3.ncbi.nlm.nih.gov/qguery/gquery.fcgi)
Theory:
In this exercise, you will use ‘Entrez’ to find entries for the coding
sequence of a gene of interest. You will use glucokinase as an initial example
(glucokinase is the enzyme that catalyzes the initial step of glycolysis in liver and
several other cell types).:
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox).
[61]
Bioinformatics Practical Manual K. C. Samal et al.
[62]
Bioinformatics Practical Manual K. C. Samal et al.
Note how many hits are now listed. You still have entries that are not
glucokinase. To further narrow your search clicks on the Limits tab one more
time. In the top left drop down menu change from All Fields to Title. This will
limit this search to those entries that have glucokinase in their title line. Still,
you will note that your entries include not only glucokinase but also
glucokinase regulatory proteins and other entries that have the term
glucokinase in the title.
Result:
• Clicking on the accession number for one of your entries will bring up the
full Nucleotide sequence information. Most of the information in an entry is
self-explanatory, but if you scroll down to the Features entry you should find
a CDS entry. This specifies that part of the nucleotide sequence below that
actually codes for a protein (often you will find untranslated regions at both
the 3' and 5' ends of a sequence). In addition, the translated sequence is given
in the one letter amino acid shorthand just above the full nucleotide
sequence.
• To obtain the sequence in a form which can be analyzed by a variety of gene
analysis software, select FASTA from the Display pull down menu. The
browser will give you a page which has the sequence without any line
numbers or breaks. Save the sequence by selecting the material beginning
with the > and going up to the last nucleotide (be sure to avoid the line above
the > and below the last nucleotide) and copying this to a word processor
program. The > line is recognized as comment by all analysis software. You
can change the font to courier 10 point to obtain the proper spacing and lines.
[63]
Bioinformatics Practical Manual K. C. Samal et al.
• To obtain this change the Display menu back to the GeneBank Display and
scroll down until you reach the CDS information. Click on the link in the
line that begins /protein_id= "xxx1234" (i.e. whatever the assigned protein id
number is).
• This will change the display to GenPept and bring up a page which shows
some of the same information, but is limited to the amino acid sequence. In
this page, change the Display menu to FASTA to obtain an output similar to
the nucleotide FASTA output (an index line which begins with > and an
amino acid sequence). You can copy the index line and sequence to a word
processor for use later (once you are in the word processor, again change the
text to courier 10 pt to retain line spacing).
• SAVE THE PROTEIN FASTA OUTPUTS (glucokinase from mammal
species of your choice) to a word processor program. You will compare the
sequences of these proteins in a future exercise.
[64]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 3:
Finding out open reading frames (ORF) through NCBI ORF finder
URL: http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Theory:
Open reading frames are regions of DNA that encode the protein. This
DNA sequences are first transcribed into mRNA then translated into protein. By
examining the sequence alone, you can determine the sequence of amino acid that
will appear in the final problem. In translation codon of three nucleotides
determines which amino acid will be added next in joining protein chain. It is
important then to decide which nucleotide start translation and when to stop, this is
sequenced it in important to determine the correct open reading frames. So in each
direction i.e., 1, 2, 3 in forward and -1, -2, -3 in backward. The reading frame that
is used determines which amino acid will be conceded by a gene. Typically one
reading frame is used in translating a gene (in cukaryotes) and this is often the
largest ORF. Once the ORF is known DNA sequence can be translated into the
corresponding amino acid sequence.
An ORF starts with an ATG (methionine) in most of the species and ends in
a stop codon (UAA, UAG, UGA) indicated by * in the protein sequence.
Procedure:
1. To browse the World Wide Web, just open your internet browser (Internet
explorer, Google chrome or Mozilla Firefox etc)/
2. In the address bar, type www.ncbi.nlm.nih.gov/gorf/gorf.html and press
‘Enter key’ on your keyboard or click go button.
Here one can see a text field to enter the GI or accession number of the query
sequence, a text box to enter the query sequence in FASTA format and a
button to run the ORF finder.
[65]
Bioinforrmatics Practical Manual K. C. Saamal et al.
3. Type the nucleotide sequence in the box provided ((iin FASTA format) or copy
your nucleotide sequence from a .txt file or word document file and passtte the
sequence in the inpuutt box.
FASTA fo
F ormat is a simplest sequence foormat whicch starts with
w a ‘>’ symbol
s
f
followed by the sequence ID, otheer commeents and computattionally
r
represented protein sequence).
s
There is a drop downn menu to select a geenetic codoon dictionaary. It conttains 20
T
d
different codon
c dictiionaries thhat containn codons for
f differennt organism ms and
o
organelles . Select anny from thee list whichh you wantt for the seaarch methood. The
f
first one iss the "standdard" whicch is the deefault codoon. Select default
d coddon list
‘Standard’’. (For exam mple, the standard
s coode AUG code
c for methionine.
m . But in
V
Vertebrate e Mitochonndrial Codde and Yeaast Mitochhondrial Code, AUA A codes
f methio
for onine).
4. Now Click the ORF finder button to get the result.
The result shows thee all the poossible sixx reading frame
T fr preseent in the entered
e
sequence query.
q Onee can see thhat the OR RF is listedd accordingg to their size and
t graphiccal represeentation of the of the sequence.
the
5. Click on the green region which represents the ORF in the sequence, to see the
ORF.
Once you click, it will
O w turn innto purplee colour inndicating thhat the particular
O
ORF is seelected. The
T selecteed ORF iss also inddicating in the list. It also
d
displays th
he length annd locationn of the sellected ORF
F
[666]
Bioinformatics Practical Manual K. C. Samal et al.
One can see the sequence of the selected ORF which actually codes for the
protein. The user can find the start codon, stop codon and the total number of
the amino acids from the sequence. Now click on Accept button.
User can also perform a BLAST search for the particular ORF that you
selected. Select the appropriate program and database. Then click on the
BLAST button.
.
[67]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 4:
Translating an unknown DNA Sequence
URL: http://web.expasy.org/translate/
Theory:
One of the most basic exercises in bioinformatics is determining if a nucleic
acid sequence actually codes for a protein. This is complicated by the fact that you
generally do not know which strand is the coding strand (i.e. whether the sequence
itself or its complementary strand will be transcribed into mRNA) nor the correct
reading frame (whether the sequence should be read three bases at a time starting
with the first nucleotide, the second or the third. Both these questions are resolved
by translating both strands in all three reading frames and looking for the one that
gives the longest amino acid sequence before a stop codon is encountered. Since
there are 64 codons and three of these codons (UAA, UAG and UGA) do not code
for any amino acid (i.e. are stop signals). You expect a stop codon to appear on
average once every 20 amino acids if you are reading a sequence in the incorrect
frame. However, things are not always that clear cut and it is possible for an out of
frame translation to extend to over 100 amino acids before a stop codon is reached.
In the exercise below you will be given an unknown DNA sequence and
asked to use a web tool to translate the sequence into an amino acid sequence and
hopefully identify the proper reading frame. You will then save this amino acid
sequence to a word processing program for use it in the next exercise.
Requirement
The sequence might be obtained by sequencing a clone from a cDNA
library or by isolating an amplified DNA fragment from PCR amplification.
Otherwise you get a sequence from nucleic acid sequence database as studied
earlier.
[68]
Bioinformatics Practical Manual K. C. Samal et al.
Procedure:
1. To browse the World Wide Web, just open your internet browser (Internet
explorer, Google chrome or Mozilla Firefox).
2. In the address bar, type the URL http://web.expasy.org/translate/ and press
‘Enter key’ on your keyboard.
A new window will open to assess the translation tool. (Translating the DNA
sequence is done by reading the nucleotide sequence three bases at a time and
then looking at a table of the genetic code to arrive at an amino acid sequence.
This program examines the input sequence in all six possible frames (i.e.
reading the sequence from 5' to 3' and from 3' to 5' starting with nucleotide at
position 1, 2 and 3 separately). What you typically look for in identifying the
proper translation is the frame that gives the longest amino acid sequence
before a stop codon is encountered. (Since there are 64 codons and three code
for nonsense, you expect a stop codon to appear on average once every 20
amino acids if you simply read a sequence "out of frame". However, "on
average" is just that, and it is possible to have an incorrect reading frame give
an extended sequence with no stop codons. The next exercise will address that
problem).
3. Type or paste your sequence in the sequence window in the ExPasy link for
translation.
Under Output format select either ‘Compact’ or ‘Verbose’. ‘Compact’ gives
the amino acid sequence as one letter codes with stop codons indicated by a
hyphen whereas ‘Verbose’ gives the amino acid sequence as three letter codes
4. Select Output format clicking either ‘Compact’ or ‘Verbose’
5. Click on Translate Sequence
Often only one reading frame will give you a translation with no stop codons,
but this is not always the case. If you get multiple possible reading frames,
one way to determine which is the most likely the true frame is to use the
[69]
Bioinformatics Practical Manual K. C. Samal et al.
Conclusion:
You have now been introduced to the use of a translation program to
identify the most probable reading frame and to translate an unknown sequence.
What if none of the six possible reading frames gives an extended amino acid
sequence? This could be due to your having errors in sequence (you need to
sequence both strands to ensure an accurate sequence). Or you may have isolated a
non-coding region of DNA (e.g. you know that the 5' and 3' ends of most genes are
not coding for protein, but serve regulatory functions. There are many untranslated
regions of DNA (exons, pseudogenes, etc). You can now take the two amino acid
sequences and determine if either matches any known sequences in the huge
protein sequence database
[70]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 5:
Identifying a gene using BLAST program
URL: http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
Once you have identified a likely reading frame for your DNA sequence, you
will want to see if it corresponds to any known protein. Alternatively, if you
obtained two reading frames of nearly equal length, you will need to decide which
is correct. To accomplish these tasks, you can compare your sequences to all of the
known protein sequences in the databases using a search tool known as BLAST.
BLAST comes in a variety of formats depending on whether you are using a DNA
sequence or a amino acid sequence and depending on whether you are searching
through nucleotide or protein databases.
You are going to do this exercise twice. First, you will take the longest open
reading frame and use it as a query sequence with BLASTP. After saving those
results, you will then take the next longest amino acid sequence and use it as our
query sequence.
Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Paste your longest translated sequence into the first box below.
5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
6. Deselect the Do CD-search box.
[71]
Bioinformatics Practical Manual K. C. Samal et al.
Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the
Layout to One Window. You will leave the Options section settings on the
Default values.
7. Click the BLAST button at the bottom or top of the screen
A new window will appear gives an estimate of how long the search will take
and which lists conserved domains in your query sequence. You may want to
copy your request id number, but usually this isn't necessary. After the
indicated time has passed,
8. Press the Format button
The results of your search will be dispayed. If similarity to any known protein
has been found, you will see a color window (which may or may not print)
showing the degree of similarity and the range of similarity. Perfect matches
show up as red, next best as purple, mediocre as green, poor matches as blue
and very poor or no match as black. If you scroll down you will see the best
10 alignments (make sure you have limited this to 10!). If the DNA sequence
has already been identified it should show up as a perfect match (score
generally between 200-400, but could be lower depending on size of peptide
analyzed. The E value will be down around 10(-50) to 10(-100)).The E value
tells you the probability that an unrelated sequence in the database could have
given the score value.
Copy the line below the color alignment window which shows the sequence
producing the best alignment. This will give you the identifiers (gi number
and other identifying numbers) you will need to download the full protein
from the database for characterization. Save this information.
[72]
Bioinformatics Practical Manual K. C. Samal et al.
[73]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 6:
Finding Domains in Protein Sequences
Theory:
Many proteins which have been classified as "globular (i.e. folded into a
compact globular shape) appear to be composed of several distinct folded regions
joined by more extended loops of amino acids. These globular sub-regions are
termed "domains" and can range in size from 20-300 amino acids. Some domains
have been associated with specific functions (e.g. catalysis of peptide bond
cleavage, ATP binding, etc), but this association must be tentative since ligand
binding or formation of an active site often takes place at the surface where two
domains interact. Identification of domains can help us to assign a newly
discovered open reading frame to a family of proteins. Domains in a newly
discovered protein can be recognized by sequence homology with known domains
in well characterized proteins, but this is still not a precise science. While new
techniques of analysis are being introduced, at the present the most user-friendly
and visual domain identification program is the SMART domain annotation
database.
Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://smart.embl-
heidelberg.de/smart/set_mode.cgi?NORMAL=1 and press ‘Enter key’ on your
keyboard.
The requested page at appears as shown below
3. Copy the full sequence of the protein identified in the previous Exercise and
past it into SMART sequence window.
4. Click the Sequence SMART button.
[74]
Bioinformatics Practical Manual K. C. Samal et al.
Depending on how busy the SMART server is, it may take a few minutes for a
result to be returned. BE PATIENT!!
The results will show you a live diagram with the domains within the query
sequence. Each domain has a unique color and shape and annotation.
Scroll down the window to see a table that lists each identified domain
together with its putative (probable) start and end point in your sequence and
the probability (E-value) assigned to that identification (the smaller the e-
value the more likely the identification is not simply due to chance).
5. Click the mouse over the domain on the figure or in the table.
It will bring up the domain name or abbreviation and the amino acid sequence
assigned to this domain at the very bottom of the window. With a PC, right
click on the image to save it as a PNG file. It can be opened Photoshop or
most any other reader.
6. Click on the domain name
It will bring up more detailed information on the domain.
Pick out one domain to examine in detail.
What are the characteristics (amino acid sequences) that define that domain?
What kinds of proteins contain this domain?
What is the function of that domain?
How similar is your sequence to the defined domain?
[75]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 7:
Nucleotide BLAST (BLASTn):
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
The BLAST (Basic Local Alignment Search Tool) programs have been
designed for speed to find high scoring local alignments. BLAST uses a heuristic
algorithm which seeks local as opposed to global alignments and is therefore able
to detect relationships among sequences which share only isolated regions of
similarity
BlastN is a pair wise sequence comparison tool developed by NCBI and the
programme compares a nucleotide query sequence with nucleotide sequence data
base. It takes nucleotides sequences and compares them against the NCBI
nucleotide databases. It is better at finding sequences similar, but not identical, to
your query.
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Nucleotide BLAST [blastn] link
A search page will appear as shown below.
4. Paste your nucleotide sequence into the first box below.
5. Choose nr database from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option
6. Deselect the Do CD-search box.
[76]
Bioinformatics Practical Manual K. C. Samal et al.
Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the
Layout to One Window. You will leave the Options section settings on the
Default values and will address these choices in a more advanced exercise.
7. Click the BLAST button at the bottom or top of the screen
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
8. Then click the BlastN option at the end of the submission page.
After few second the result of your blast programme will appear in a new
window. The first part shows a Graphic View of the matches, followed by a
list of the matches and then the Individual Alignments. In the result page a
number of hits were displayed. Out of large number of sequence those hits
were chosen on basis of lowest e- value. The sequences showing e- value is
more similar to each other.
[77]
Bioinformatics Practical Manual K. C. Samal et al.
[78]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 8-
Protein BLAST (Blastp):
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
THEORY –
BlastP is a pair wise sequence comparison tool developed by NCBI and the
programme compares a amino acid query sequence of a protein with amino acid
sequence of protein data base. It takes amino acid sequences and compares them
against the NCBI protein databases. The program allows to discover the structures
and functions of proteins.
BlastP uses the BLAST algorithm to compare an amino acid query
sequence against a protein sequence database.
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Paste your amino acid sequence of a protein or longest translated sequence
into the first box below.
5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
6. Deselect the Do CD-search box.
Scroll down this page to the Format Section - in this section use the pull-down
menus to change the Descriptions to 10 and the Alignments to 10. Change the
[79]
Bioinformatics Practical Manual K. C. Samal et al.
Layout to One Window. You will leave the Options section settings on the
Default values and will address these choices in a more advanced exercise.
7. Click the BLAST button at the bottom or top of the screen
After few second the result of our blast programme will appear in a new
window. The first part shows a Graphic View of the matches, followed by a
list of the matches and then the Individual Alignments. Here a number of hits
were displayed. Out of large number of sequence, those hits were chosen on
basis of lowest e- value. The sequences showing e- value is more similar to
each other.
[80]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise-9
Translated BLAST (Blastx)
URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi
Theory:
Blastx searches protein database using a translated nucleotide query. Blastx
uses the BLAST algorithm to compare the six-frame conceptual translation
products of a nucleotide query sequence (both strands) against a protein sequence
database. The BLAST (Basic Local Alignment Search Tool) programs have been
designed for speed to find high scoring local alignments. BLAST uses a heuristic
algorithm which seeks local as opposed to global alignments and is therefore able
to detect relationships among sequences which share only isolated regions of
similarity
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastx] link
After clicking a new page appear. This is the sequence submission page
3. Enter the Nucleotide sequence into the Search dialog box.
4. Use the default settings to search the Non-redundant protein sequences (nr)
database.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Select the
search and format options that you want for your data output. For some
proteins you may gets hundreds of hits. Therefore, you would limit the
number on the first search. Recheck that all the information is correct.
[81]
Bioinformatics Practical Manual K. C. Samal et al.
5. To submit the request, Click the BLAST button at the bottom or top of the
screen.
After few second the result of our blastx programme will appear in a new
window. Number of hits will be displayed. The blastx report is very similar to
the blastn report. The first part shows a Graphic View of the matches,
followed by a list of the matches and then the Individual Alignments. The
BLASTX search with the same sequence shows a significant number of very
good matches. Out of large number of sequences those hits were chosen on
basis of lowest e- value.
[82]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 10:
tBLASTX
URL: http://www.ncbi.nlm.nih.Gov/BLAST
Theory:
TBlastx compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database using the
BLAST algorithm.
PROCEDURE –
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click [tblastx] link.
That searches translated nucleotide database using a translated nucleotide
After clicking a new page appear. This is the sequence submission page
3. Enter the nucleotide sequence into the Search dialog box.
4. Use the default settings to search the Non-redundant protein sequences (nr)
database.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Select the
search and format options that you want for your data output. For some
proteins you may gets hundreds of hits. Therefore, you would limit the
number on the first search. Recheck that all the information is correct.
5. To submit the request, Click the BLAST button at the bottom or top of the
submission page screen.
[83]
Bioinformatics Practical Manual K. C. Samal et al.
After few second the result of our blast programme will appear in a new
window. Out of large number of sequences those hits were choose on basis of
lowest e- value. The sequences showing e- value is more similar to each other.
[84]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 11:
PSI-BLAST (position specific interacted BLAST)
URL: http://www.ncbi.nlm.nih.Gov/BLAST
Theory:
Position specific iterative BLAST (PSI BLAST) was created in 1997. PSI-
BLAST represents an extension of BLAST where position specific scoring is used.
What this means is that when looking for word matches in the database, you create
a “profile” or family for the words you are looking for. Once you found all matches
within a certain significance threshold, you use the obtained profiles to refine the
search by repeating the procedure. This allows us to find more significant matches.
The profiles are represented as substitution matrices.
Procedure
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press
‘Enter key’ on your keyboard.
The blast page at the NCBI appears as shown below.
3. Under Basic BLAST heading, click Protein BLAST [blastp] link
A search page will appear as shown below.
4. Under program selection heading, click PSI-BLAST (Position-Specific
Iterated BLAST) button
5. Paste your protein sequence in search window section or simply write the GI
number of the protein.
6. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu.
Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option. Enter the
threshold values that determine how divergent the protein that you are
[85]
Bioinformatics Practical Manual K. C. Samal et al.
interested in finding one. The rest of the parameter are generally used at the
set default settings.
7. Then click on the BLAST button to initiate the first round of PSI BLAST
search.
The time it takes can be longer than what it says on screen. Be patient. An
intermediate page (entitled Reformatting Blast) appears containing a ‘Format’
button.
8. Click this Format button.
A new page appears in a new window entitled results of Blast., This is where
your results will be displayed when ready.
9. Inspect the results.
There are many very similar sequences and only a few distantly related.
10. Click on the run PSI BLAST iteration 2 button (Near the top of the page).
The Reformatting Blast window pops up.
11. Click the Format button on the Reformatting Blast window.
The results will appear in the results of the Blast window. This can be
repeated till a convergence of protein is achieved or known further
convergence is possible.
12. Continue repeating Steps 10-11.
The results will appear in the results of the Blast window. PSI BLAST output
consists of many iterations. Each iteration has a hit list, the alignment and the
parameters used for the analysis of PSI BLAST. Each iteration page contains
an interaction button to go through the next interaction.
Conclusion
PSI BLAST program is most widely used protein similarity search program
among the entire BLAST program. PSI BLAST offers exiting opportunities to
[86]
Bioinformatics Practical Manual K. C. Samal et al.
discover new type of relationship in protein data base and use to infer evolutionary
origins of protein. The PSI BLAST is a highly sensitive homology search program.
[87]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 12:
Sequence alignment through FASTA
URL: http://www.ebi.ac.uk/Tools/sss/fasta/
Theory:
Compare a protein sequence to a protein sequence database using the
FASTA algorithm (Pearson and Lipman, 1988, Pearson, 1996). It provides
sequence similarity searching against protein databases using the FASTA suite of
programs. FASTA provides a heuristic search with a protein query. FASTX and
FASTY translate a DNA query. Optimal searches are available with SSEARCH
(local), GGSEARCH (global) and GLSEARCH (global query, local database.
Search speed and selectivity are controlled with the ktup (wordsize) parameter. For
protein comparisons, ktup = 2 by default; ktup =1 is more sensitive but slower.
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://www.ebi.ac.uk/Tools/sss/fasta/ and press ‘Enter
key’ on your keyboard.
The FASTA homepage will appear in which the different options like
program, database, result, search title, your email, matrix, gap extension, k-
tup, expected lower value, DNA strand, histogram, mode type, score,
alignment, sequence pair database range, filter statistical estimate.
3. Under Basic Program heading, click Protein link.
4. Select the date base from data base pull sown menu.
5. Paste your sequence or upload the file containing sequence.
6. Set your parameters.
Matrix: Matrix option is used to set the matrix which is used for searching the
data base.
[88]
Bioinformatics Practical Manual K. C. Samal et al.
Gap penalties: it has two options one is Gap opening and Gap extension.
Default gap opening penalty for proteins is -12 and -16 for DNA. The gap
extension penalty is -2 for protein and -4 for DNA.
Score: Score option gives the maximum number of reported scores in the
output file.
K-tup: Change this value to limit the word length. The search should use.
Strand: This option let you chose which strand to search with the respective
data band.
Histogram: Selecting this option to ‘yes’ will display the search histogram of
the expected frequency of chance occurrence of the data base matches found.
Expectation value upper limit and lower limit: This option is used for score an
alignment display. The default values for upper limit are 10.0 for protein
search,
• Sequence range: This option allows the user to denote which region within
the query seq. should be searched.
• Database range option sets the sequence range to search within the dbs.
• Multype : The multypeoptionis used to choose the molecule type of the
query in use for a search.
• Filter: This option can eliminates statistical significance but biological
uninteresting reports from the first FASTA search.
• Statistical estimates option is used for statistical calculations.
• Then click the requisite option in different places as per our requirement.
Otherwise leave as such the programme will take all default option.
7. Then click the submit button.
For some proteins you may gets hundreds of hits. Therefore, you would limit
the number on the first search. Recheck that all the information is correct. A
histogram along with the alignment will come.
[89]
Bioinformatics Practical Manual K. C. Samal et al.
[90]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 13:
Editing and analyzing multiple sequence alignment using Jalview
URL: http://www.jalview.org/
Theory:
Jalview is a piece of bioinformatics software that is used to look at and edit
multiple sequence alignments. It is written in the Java programming language.
Jalview is a free program for multiple sequence alignment editing, visualization
and analysis. Jalview has a wide range of functions and is used to view and edit
sequence alignments, analyze them with phylogenetic trees and principal
components analysis (PCA) plots and explore molecular structures and annotation.
Procedure:
1. To browse the World Wide Web, just open your favourite internet browser
(Internet explorer, Google chrome or Mozilla Firefox etc).
2. In the address bar, type http://www.jalview.org/ and press ‘Enter key’ on your
keyboard.
3. Paste the MSA or on align sequences into the seq. window then click the run
button so that an initial result page will appear.
4. Then the browser returns a page which loads the java applet into the memory
of the computer then inside this page the word Jalview appears as a button.
Then click on the Jalview button to obtain the result.
Jalview can run in offline. For this load the Jalview into the computer when
selected the file menu and click word offline option in the internet browser.
In the Jalview window select file and then click input alignment via text box.
Paste the MSA in the text box and then selected the format that
correspondence to the MSA for the alignment format top down menu. Then
click, the apply button.
Result :
¾ In the result page of Jalview edit a group of sequence by using the editing
window in the result page.
[91]
Bioinformatics Practical Manual K. C. Samal et al.
¾ In the pop of window click the odd new group button and then the add
selected ids button. Then click apply and choose button to finish.
¾ Then choose edit and when a group editing mode from the main menu and
then click or anywhere on a sequence and drag to the left or right to insert
or remove gaps.
¾ Save the alignment that is produced from Jalview by using the following
options:
•
Choose file and then output alignment via textbox from the Jalview
main menu.
• Then select the alignment format and click apply button to get a
formatted alignment appears in the window.
• Then open a Microsoft word document. Select, copy and paste the
alignment from the Jalview textbox to the word document and save the
document.
¾ For publishing the multiple sequence alignment use the box shed utility
which sheds the column according to their level of conservations and
produces files that are useful for publication.
Conclusion:
Jalview is a online and offline tool for editing and analyzing the NSA which
gives a good looking format which can then be used for publishing.
[92]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 14:
Making multiple alignment with T-coffee
URL: http://www.ch.embnet.org/soiltware/Tcoffee.html
Theory:
T-Coffee (Tree-based Consistency Objective Function For alignment
Evaluation) is the multiple sequence alignment software using a progressive
approach. It generates a library of pairwise alignments to guide the multiple
sequence alignment.
T-Coffee has two main features. First, it provides a simple and flexible
means of generating multiple sequence alignments, using heterogeneous data
sources. The data from these sources are provided to T-Coffee via library of pair-
wise alignments. The second main feature of T-Coffee is the optimization method,
which is used to find the multiple alignment that best fits the pair-wise alignments
in the input library. You use a so-called progressive strategy which is similar to that
used in ClustalW. This has the advantage of being fast and relatively robust. T-
Coffee is a progressive alignment with an ability to consider information from all
of the sequences during each alignment step, not just those being aligned at that
stage.
Procedure
1. Open any internet browser like Internet explorer, Google chrome etc.
2. In the address bar write NCBI and click on enter button then Home page will
come.
3. Search for any two or more nucleotide sequences in FASTA format and copy
it on Microsoft word page.
4. Open new internet tab and search for T-Coffee.
5. Home page will come Point the browser to the T-coffee server homepage.
6. Click the mouse over make a multiple alignment in the table and click
regular. By clicking the mouse the multiple alignment page appears.
[93]
Bioinformatics Practical Manual K. C. Samal et al.
7. Enter the E-mail address in that page, so that if the job time will out the result
can be returned by E-mail.
8. Paste the sequences in the box used for alignment and then click the T-Coffee
button at the top or the button of the page to obtain the result.
9. Click on Submit button
Then T-Coffee alignment result will come
Result
T-Coffee returns a table that contains hyperlinks to the result.
The first row of the table is duplicated to multiple sequence alignment and
includes
¾ Aln- A text file in the same format as clustalW alignments
¾ HTML- A colourised alignment where every residue appears on a
background that indicates quality of this alignment. Rcad indicates high
quality segments while blue indicates no trusted region.
¾ Pdf- It can be easy or to display and print due to pdf file.
The second row dedicated to phylogenetic tree and includes:
¾ Dnd- The guide tree or dendrogram generated by Tcoffee in newick
format.
¾ Ph- This is a real phylogenetic tree in newick format using the neighbor
joining method.
¾ Png – The gif picture of the phylogenetic tree that corresponds to the Ph
file.
Advantages
¾ It produces more accurate alignments than the other methods.
¾ It is equipped with many different tools and modules such as CORE, M-
coffee and EXPRESSO for structure alignment, evaluation and combining
alignments.
¾ T-coffee can deal with many input formats, including FASTA, Swiss-Prot
and PIR (Protein Information Resource).
¾ T-coffee produces sequence alignment in various formats so that it can be
used as an input for another program. It also produces a colorized
[94]
Bioinformatics Practical Manual K. C. Samal et al.
[95]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 14
Performing Online Mendelian Inheritance in Man (OMIM)
URL:- http://www.ncbi.nlm.nih.gov/omim / www.omim.org
Theory:-
OMIM is a comprehensive, authoritative compendium of human genes and
genetic phenotypes that is freely available and updated daily. OMIM is authored
and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins
University School of Medicine, under the direction of Dr. Ada Hamosh. Its official
home is omim.org.(According to NCBI)
[96]
Bioinformatics Practical Manual K. C. Samal et al.
Procedure:-
¾ Open any internet browser Internet explorer/google chrome/mozilla firefox
¾ In the address bar type OMIM and press enter or search
¾ Different websites with little explanation are appeared in new page. Study
the listed websites and then click any one of them till you get your require
information.
¾ Open any internet browser
¾ In the address box type www.google.com
¾ In the search box type OMIM
¾ Different websites with little explanations will appear
¾ Study the listed web site and click any one of them i.e
http://www.ncbi.nlm.nih.gov/omim / www.omim.org
¾ When you type www.omim.org you get its home page
¾ On the search box of that page type any human gene suppose insulin, then
click on search
¾ On the new page you will get different aspects on human gene
¾ From that click on the desired aspect
¾ Suppose you click on #610549 Icd+ Diabetes Mellitus, Insulin-Resistant,
with Acanthosis nigricans
[97]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 16
Studying about Protein Structure Database
URL: http://www.rcsb.org/pdb/home/home.do
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
Theory:
The Protein Data Bank (PDB) is a repository for the three-dimensional
structural data of large biological molecules, such as proteins and nucleic acids..
The data, typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organizations (PDBe,
PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide
Protein Data Bank (wwPDB)
Procedure:
1. Open any Internet browser or google chrome or mozilla firefox etc.
2. In the address bar click www.google.com
3. Then in search bar type PDB,SCOP and CATH.
4. Press enter or click the search button.
5. Different websites with little explanation are appeared in new page.
6. Study the listed websites and anyone of that till you get then write.
[98]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 17
Depositing sequences in database
URL: BankIt [http://www.ncbi.nlm.nih.gov/BankIt/],
Sequin http: //www.ncbi.nlm. nih.gov/Sequin/index.html
The GenBank sequence database is an annotated collection of all publicly
available nucleotide sequences and their protein translations. This database is
produced at National Center for Biotechnology Information (NCBI) as part of an
international collaboration with the European Molecular Biology Laboratory
(EMBL) Data Library from the European Bioinformatics Institute (EBI) and the
DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive
sequences produced in laboratories throughout the world from more than 100,000
distinct organisms. GenBank continues to grow at an exponential rate, doubling
every 10 months.
Direct submissions are made to GenBank using
1. BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], which is a Web-based form,
or the stand-alone submission program, or
2. Sequin [http: //www.ncbi.nlm. nih.gov/Sequin/index.html].
Upon receipt of a sequence submission, the GenBank staffs assign an
Accession number to the sequence and perform quality assurance checks. The
submissions are then released to the public database, where the entries are
retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed
Sequence Tag (EST), Sequence Tagged Site (STS), Genome Survey Sequence
(GSS), and High-Throughput Genome Sequence (HTGS) data are most often
submitted by large-scale sequencing centres. The GenBank direct submissions
group also processes complete microbial genome sequences.
Submission Tool:
Direct submissions to GenBank are prepared using one of two submission
tools, BankIt or Sequin.
[99]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 18:
Submitting sequences to Genbank through ‘BankIt’
URL: [http://www.ncbi.nlm.nih.gov/BankIt/]
Theory:
BankIt is a Web-based form that is a convenient and easy way to submit a
small number of sequences with minimal annotation to GenBank. To complete the
form, a user is prompted to enter submitter information, the nucleotide sequence,
biological source information, and features and annotation pertinent to the
submission. BankIt has extensive Help [http://www.ncbi.nlm.nih.gov/
BankIt/help.html] documentation to guide the submitter. Included with the Help
document is a set of annotation examples that detail the types of information that
are required for each type of submission. After the information is entered into the
form, BankIt transforms this information into a GenBank flat file for review. In
addition, a number of quality assurance and validation checks ensure that the
sequence submitted to GenBank is of the highest quality. The submitter is asked to
include spans (sequence coordinates) for the coding regions and other features and
to include amino acid sequence for the proteins that derive from these coding
regions. The BankIt validator compares the amino acid sequence provided by the
submitter with the conceptual translation of the coding region based on the
provided spans. If there is a discrepancy, the submitter is requested to fix the
problem, and the process is halted until the error is resolved. To prevent the deposit
of sequences that contain cloning vector sequence, a BLAST similarity search is
performed on the sequence, comparing it to the VecScreen
[http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html] database. If there is a
match to this database, the user is asked to remove the contaminating vector
sequence from their submission or provide an explanation as to why the screen was
positive. Completed forms are saved in ASN.1 format, and the entry is submitted to
the GenBank processing queue. The submitter receives confirmation by email,
indicating that the submission process was successful.
[100]
Bioinformatics Practical Manual K. C. Samal et al.
[101]
Bioinformatics Practical Manual K. C. Samal et al.
[102]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 19:
Submitting sequences to Genbank through ‘Sequin’
URL: http://www.ncbi.nlm.nih.gov/Sequin/index.html
Theory:
Sequin is more appropriate for complicated submissions containing a
significant amount of annotation or many sequences. It is a stand-alone application
available on NCBI's FTP [ftp://ftp.ncbi.nih.gov/sequin/] site. Sequin creates
submissions from nucleotide and amino acid sequences in FASTA format with
tagged biological source information in the FASTA definition line. As in BankIt,
Sequin has the ability to predict the spans of coding regions. Alternatively, a
submitter can specify the spans of their coding regions in a five column, tab-
delimited table [http://www.ncbi.nlm.nih.gov/Sequin/table.html] and import that
table into Sequin. For submitting multiple, related sequences, e.g., those in a
phylogenetic or population study, Sequin accepts the output of many popular
multiple sequence-alignment packages, including FASTA+GAP, PHYLIP,
MACAW, NEXUS Interleaved, and NEXUS Contiguous. It also allows users to
annotate features in a single record or a set of records globally.
[103]
Bioinformatics Practical Manual K. C. Samal et al.
¾ Open that software by double clicking on software icon then this Welcome
page will occur.
¾ The Sequence Format form asks for the type of submission (single
sequence, segmented sequence, or population, phylogenetic, or mutation
[104]
Bioinformatics Practical Manual K. C. Samal et al.
study). For the last three types of submission, which involve comparative
studies on related sequences, the format in which the data will be entered
also can be indicated. The default is FASTA format (or raw sequence), but
various contiguous and interleaved formats (e.g., PHYLIP, NEXUS,
PAUP, and FASTAGAP) are also supported. These latter formats contain
alignment information, and this is stored in the sequence record.
¾ The Organism and Sequences form asks for the biological data. On the
Organism page, as the user starts to type the scientific name, the list of
frequently used organism’s scrolls automatically. (Sequin holds
information on the top 800 organisms present in GenBank.). Thus, after
typing a few letters, the user can fill in the rest of the organism name by
clicking on the appropriate item in the list. Sequin now knows the scientific
name, common name, GenBank division, taxonomic lineage, and, most
importantly, the genetic code to use. (For mitochondrial genes, there is a
control to indicate that the alternative genetic code should be used.) For
organisms not on the list, it may be necessary to set the genetic code
control manually. Sequin uses the standard code as the default. The
remainder of the Organism and Sequences form differs depending on the
type of submission.
Organism and Sequences Form
[105]
Bioinformatics Practical Manual K. C. Samal et al.
¾ Advance through the pages that make up each form by clicking on labelled
folder tabs or the Next Page button. After the basic information forms have
been completed and the sequence data imported, Sequin provides a
complete view of your submission, in your choice of text or graphic
format.
¾ At this point, any of the information fields can be easily modified by
double-clicking on any area of the record, and additional biological
annotations can be entered by selecting from a menu.
¾ Sequin has an on-screen Help file that is opened automatically when you
start the program.
¾ Because it is context sensitive, the Help text will change and follow your
steps as you progress through the program. A "Find" function is also
provided.
¾ Sending the Submission - A finished submission can be saved to disk and
E-mailed to one of the databases. It is also a good practice to save
frequently throughout the Sequin session, to make sure nothing is
inadvertently lost. The list at the end of this chapter provides E-mail
addresses and contact information for the three databases.
[106]
Bioinformatics Practical Manual K. C. Samal et al.
Exercise 20
Primer designing
URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi
Theory:
Oligo-nucleotides, also referred to as primers, are short single strands of
nucleic acids that are synthesized from either DNA or RNA in order to bind to a
complementary strand. Primers have a target area where they bind and act as the
starting point for polymerase to extend from, and thus determine what segment of
DNA gets amplified. DNA consists of a double stranded helix. One strand of the
DNA is named the “sense” strand and the other strand is the “anti-sense” strand.
These two DNA strands are complements of each other. During PCR, the
denaturing step will break the hydrogen bonds, separating the two strands. This
allows the primers to anneal to the target region on the DNA during the annealing
step. One primer is designed to anneal to the sense strand and the other primer
needs to bind to the anti-sense strand.
When designing primers for PCR it is necessary to take into consideration
things like: how many primers are needed, the length of the primer, the 5’ and
3’end, the mutation location in primer, the primer melting/annealing temperature,
the G-C content, “primer dimmer” and the distance between the forward and
reverse primers.
Length
The length of the primers need to between 15 and 30 base pairs so that they
are long enough for adequate specificity and short enough for them to anneal to the
DNA template.
The 5’ and 3’end
The primers need to be designed so that the 3’ end of the forward primer will
extend toward the reverse primer. The 3’ end of the reverse primer need to also
extend toward the forward primer. The 3’ ends of the forward and reverse primers
[107]
Bioinformatics Practical Manual K. C. Samal et al.
should be facing each other from opposite DNA strands. This will facilitate the
continued replication of the desired strand of DNA. If, for instance, the 3’ ends do
not elongate in opposite directions (i.e., toward each other) replication will not
work and a PCR product will not be obtained.
Primer Melting Temperature
The Primer Melting Temperature (Tm) is important for the annealing phase
of PCR. Preferred temperatures should be between 50°C and 65°C. The forward
and reverse primer melting temperatures should be no more than 2°C different. To
calculate the Tm, Tm=4°C x (#G’s + C’s in the primer) + 2°C x (# A’s + T’s).
The G-C content
The primer sequence should be relatively high as it has a direct relationship
with the Tm. There should be a base composition of G-C of about 50%-60%. The
3’ end of the primer should finish with at least one G or C to promote efficiency in
annealing due to the stronger bonding.
Distance between the Forward and Reverse
The forward primer and the reverse primer should be between 300 and 2,000
base pairs apart.
Beware of “Primer Dimer”
Primer Dimer is an artefact of PCR where primers bind to each or to
themselves other instead of the template DNA and thus act as their own template to
make a small PCR product and appear faintly on an electrophoresis gel. To avoid
“primer dimers”, be sure there are not many complementary areas in the base
sequence of your forward and reverse primers where the primer strands would be
able to bind to each other instead of the gene.
Things to Avoid
¾ To avoid non-specific binding, design the primers with high annealing
temperatures.
¾ To make sure the primers designed will only bind to the target area submit
the sequence to the BLAST website.
[108]
Bioinformatics Practical Manual K. C. Samal et al.
¾ The MgCl2 and pH conditions can also be adjusted for improved amplified
product.
¾ Watch out for runs of singles bases of G’s, C’s, A’s, and T’s when
developing primers because they can allow mis-priming.
¾ Keep in mind that the more nucleotide bases that the primer is made up of,
the more expensive they are. The shorter the primers are, the less
specificity they have in PCR.
Resources for General Purpose PCR Primer Design
¾ Primer3
¾ Primer3Plus
¾ PrimerZ
¾ PerlPrimer
Aim: - Primer Design on the Web Using Primer3 for STAR-1 GENE in rice
URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi
Procedure
¾ Collect the sequence for which primer has to design, in Fasta format from
NCBI home page.
¾ Open the source web site -http://frodo.wi.mit.edu/cgi-
bin/primer3/primer3_www.cgi
¾ Paste the sequence in fasta format in space of the home page of the
website.
¾ Set the defaults and click ‘pick primers’ to get the result.
[109]
Bioinformatics Practical Manual K. C. Samal et al.
[110]
View publication stats