Basic Bioinformatics - S. Ignacimuthu

Contents i
Basic Bioinformatics
Second Edition
ii Contents
Second Edition
S. Ignacimuthu, s.j.
α
Alpha Science International Ltd.
Oxford, U.K.
Second Edition
242 pgs. | 54 figs. | 16 tbls.
Director
Entomology Research Institute
Loyala College, Chennai
Copyright © 2013
ALPHA SCIENCE INTERNATIONAL LTD.
7200 The Quorum, Oxford Business Park North
Garsington Road, Oxford OX4 2JZ, U.K.
www.alphasci.com
All rights reserved. No part of this publication may be reproduced, stored

in a retrieval system, or transmitted in any form or by any means, electronic,
mechanical, photocopying, recording or otherwise, without prior written
permission of the publisher.
ISBN 978-1-84265-804-8
E-ISBN 978-1-84265-978-6
Printed in India
Contents v
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Dedicated to
Rev. Fr. Adolfo Nicolas, S.J.
the Superior General
of the Society of Jesus
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
vi Contents
Preface to the Second Edition
As I thank the readers for their tremendous support for my book 'Basic
Bioinformatics', I am happy to bring out the second edition of this book for the
benefit of the readers. In recent years bioinformatics has been gaining
importance. Being an interface between modern biology and informatics, it
involves the discovery, developments and use of computational algorithms
and software tools that facilitate an understanding of the biological processes
with a goal to serve healthcare and other sectors of human endeavours.
From the time Paulien Hogeweg and Ben Hesper coined the word
bioinformatics in 1978 to refer to the study of information processes in biotic
systems, rapid developments have taken place in mapping and analyzing
DNA and protein sequences, developing new databases, aligning different
sequences, comparing them, viewing 3-D models of protein structures,
studying the molecular interaction and carrying out drug discovery analyses.
I am immensely happy to present the revised edition which includes all
the up to date basic information relating to different areas of bioinformatics
along with some procedures to have hands on experience. I am sure the
students and teachers will greatly benefit from this book.
viii Contents
Contents ix
Preface to the First Edition
Bioinformatics is an interdisciplinary subject. It is the science of using

information to understand biology. In bioinformatics biology, computer
science and mathematics merge into a single discipline. Strictly speaking,
bioinformatics is a large subjective of the computational biology, the
application of information technology to the management of biological data.
Biological data are being produced at a phenomenal rate as seen in
genomic repository of nucleic acid and protein sequences. The three-fold aim
of bioinformatics includes organization and preservation of data, development
of tools and resources and analysis of data and interpretation of results using
the tools. Thus it is the science of storing, extracting organizing analyzing,
interpreting and utilizing biological information.
Since the beginning of 1990s, many laboratories are analyzing the full
genome of several species such as bacteria, yeast mice and humans. Due to
these collaborative efforts enormous amount of data are collected and stored in
databases, most of which are publicly accessible. These data have to be
analyses in order to understand their relevance. Nucleotide and amino acid
sequence which have to be studied. Mining these immense store houses of
data to secure vital information for research and product development is one
of the activities of bioinformatics.
Bioinformatics not only provides theoretical background and practical
tools for scientists to analyze proteins and DNA but also helps in sequence
homology analysis and drug design, Two principal approaches underpin all
the studies in bioinformatics. First is that of comparing and grouping and
grouping the data according to biologically meaningful similarities and
second, that of analyzing one type of data to infer and understand the
observation for another type of data. The types of analysis that are carried out
are: alignments, multiple alignment, databases search, signals, patterns or
map in DNA or protein sequences, open reading frame and secondary
structure prediction.
Keeping in view the wide applicability of bioinformatics in different
areas, it important to prepare well-trained human resource to face the
challenges of the post genomic era, This book is intended to give the basics of
x Preface to the First Edition
biological concepts, biological database and internet based bioinformatics
tools. We are hopeful that this book will cater to the immediate needs of
students, researchers, faculty members and pharmaceutical industries.
Contents xi
Acknowledgements
I am thankful to many of my friends who constantly encouraged me to write

this book. I am grateful to Dr. C. Muthu for typesetting the manuscript and
getting it ready for publication. Let me also thank Mr. R. Mahimairaj for
preparing the illustration and Mr. A. Stalin for helping in verifying the web
addresses. I am indebted to various publishers and authors for permitting me
to use some of the illustrations and the explanations from their book. Let me
congratulate the publishers for their good work.
xii Contents
Contents xiii
Contents
Preface to the Second Edition vii

Preface to the First Edition ix
Acknowledgements xi
1. History, Scope and Importance 1.1
1.1 Important Contributions 1.2
1.2 Sequencing Development 1.7
1.3 Aims and Tasks of Bioinformatics 1.10
1.4 Application of Bioinformatics 1.11
1.5 Challenges and opp ortunities 1.14
Study Questions 1.15
2. Computers, Internet, World Wide Web and NCBI 2.1
2.1 Computers and Programs 2.1
2.2 Internet 2.3
2.3 World Wide Web 2.6
2.4 Browsers and Search Engines 2.7
2.5 EMBnet and SRS 2.9
2.6 NCBI 2.11
3. DNA, RNA and Proteins 3.1
3.1 Background 3.1
3.2 DNA 3.5
3.3 RNA 3.9
3.4 Transcription and Translation 3.14
3.5 Proteins and Amino acids 3.19
4. DNA and Protein Sequencing and Analysis 4.1
4.1 Genomics and Proteomics 4.2
xiv Contents
4.2 Genome Mapping 4.4
4.3 DNA Sequencing Method 4.6
4.4 Open Reading Frame (ORF) 4.9
4.5 Determining Sequence of a Clone 4.10
4.6 Expressed Sequence Tags 4.12
4.7 Protein Sequencing 4.14
4.8 Gene and Protein Expression Analysis 4.15
4.9 Human Genome Project 4.25
5. Databases, Tools and their Uses 1.1
5.1 Importance of Databases 5.1
5.2 Nucleic Acid Sequence Databases 5.6
5.3 Protein Sequence Database 5.9
5.4 Structure Databases 5.13
5.5 Bibliographic Databases and Virtual Library 5.19
5.6 Specialized Analysis Packages 5.20
5.7 Use of Databases 5.25
6. Sequence Alignment 6.1
6.1 Algorithm 6.1
6.2 Goals and Types of Alignment 6.2
6.3 Study of Similarities 6.4
6.4 Scoring Mutations, Deletions and Substitutions 6.7
6.5 Sequence Alignment Methods 6.11
6.6 Pairwise Alignment 6.12
6.7 Multiple Sequence Alignment 6.17
6.8 Algorithms for Identifying Domains within a Protein Structure 6.22
6.9 Algorithms for Structural Comparison 6.23
6.10 Carring Out a Sequence Search 6.23
7. DNA and Protein Sequences 7.1
7.1 Gene Prediction Strategies 7.1
7.2 Protein Prediction Strategies 7.4
7.3 Protein Prediction Programs 7.15
7.4 Molecular Visualization 7.17
8. Homology, Phylogeny and Evolutionary Trees 8.1
8.1 Homology and Similarity 8.1
Contents xv
8.2 Phylogeny and Relationships 8.3
8.3 Molecular Approaches to Phylogeny 8.14
8.4 Phylogenetic Analysis Databases 8.16
9. Drug Discovery and Pharmainformatics 1.1
9.1 Discovering a Drug 9.1
9.2 Pharmainformatics 9.5
9.3 Search Programs 9.7
Appendix: List of Important Websites and Web Addresses A.1
Glossary G.1
References R.1
Index I.1
C H A P T E R
1
History, Scope and
Importance
In its broadest sense, the term bioinformatics can be considered to mean

information technology applied to the management and analysis of biological
data. From 1950 onwards, large amount of sequence data related to various
living organisms have been collected and stored in databases.
Since it is not very convenient to compare the sequences of several
hundred nucleotides and amino acids by hand, several computational
techniques were developed. Where data can be amassed faster than they can
be analyzed and utilized, there is a great need for professionals who can use
software to digest this ever-growing mass of information.
Definitions
Bioinformatics is defined in various ways. Some of the definitions are as
follows:
(i) Bioinformatics is the use of computer in solving information problems
in life sciences; mainly it involves the creation of extensive electronic
database on genomes and protein sequences. Secondarily it involves
techniques such as the three-dimensional modeling of biomolecules
and biological systems.
(ii) Bioinformatics is a computational management of all kinds of biological
informations, including genes and their products, whole organisms or
even ecological systems.
(iii) Bioinformatics is an integration of mathematical, statistical and
computational methods to analyse biological, biochemical and
biophysical data. It deals with methods of storing, retrieving and
analyzing biological data, such as nucleic acid and protein sequences,
structures, functions, pathways and genetic interactions.
(iv) Bioinformatics is the storage, manipulation and analysis of biological
information via computer science. Bioinformatics is an essential
infrastructure underpinning biological research.
1.2 Basic Bioinformatics
(v) Bioinformatics is the application of the methods of computational

techniques and technologies to analyse and maintain biological data.
1.1 IMPORTANT CONTRIBUTIONS

Hereunder we are giving a chronological list of developments that contributed
to the emergence of bioinformatics
1866 Gregor Mendel published the results on his investigations of the
inheritance of ‘factors’ in pea plants.
1869 F. Miescher discovered DNA (published in 1871); he also suggested
that the genetic information may exist in the form of molecular text
1928 Erwin Schrodinger proposed that this factor is of 1000 angstroms.
1933 Tiselius introduced a new technique known as electrophoresis for
separating proteins in solution.
1938 Astbury and Bell suggested that the bases form the long scroll of DNA
on which is written the pattern of life
1944 Avery et al. established the genetic role of DNA
1947 First sequencing of a pentapeptide graminicidine S was done by
Consden et al.
1949 The A=T and G=C rule was discovered by Chargaff et al.
1951 Pauling and Corey proposed the structure for the alpha helix and
beta-sheet of polypeptide chain of protein.
• Reconstruction of partial 30 residue sequence of insulin by Sanger
and Tuppy.
1952 Rosalind and Wilkins used X-ray crystallography to reveal repeating
structure of DNA
1953 Watson and Crick proposed the double helix model for DNA
1954 Perutz’s group developed heavy atom methods to solve the phase
problem in protein crystallography.
1955 F. Sanger announced the sequence of bovine insulin
1957 Arthur Kornberg produced DNA in a test tube
1958 The first integrated circuit was constructed by Jack Kilby at Texas
Instruments
• The Advanced Research Projects Agency (ARPA) was formed in
USA
1962 Zuckerkandl and Pauling initiated studies on the variability of
sequences and evolution
1963 Ramachandran plot or Ramachandran diagram was developed by
G.N. Ramachandran, C. Ramakrishnan and V. Sasisekharan. They
also discovered the triple helical structure of collagen.
1965 M. Dayhoff observed that many amino acids were replaced in
evolution not in a random way but with specific preferences
History, Scope and Importance 1.3
1968 Werner Arber, Hamilton Smith and Daniel Nath described uses of
restriction enzyme
• Packet-switching network protocols were presented to ARPA
1969 Linking computers at Stanford and UCLA created the APRANET
1970 The details of the Needleman Wunsch algorithm for sequence
comparison were published.
• A.J. Gibbs ad G.A. McIntyre described a new method for comparing
two amino acid and nucleotide sequences using dot matrix
1971 Ray Tomlinson (BBN) invented the email program
1972 Gatlin offered the first information - theoretical treatment of the
sequence
• Wireframe models of biological molecules were presented by
Levinthal and Katz
• Paul Berg made the first recombinant DNA molecule using ligase
enzyme
• Stanley Cohen, Annie Chang and Herbert Boyer produced the
first recombinant DNA organism
1973 Joseph Sambrook and his team refined DNA electrophoresis technique
using agarose gel
• Stanley Cohen cloned DNA
• Brookhaven Protein Data Bank was announced
• Robert Metcalfe described Ethernet in his Ph.D. thesis
1974 Charles Goldfarb invented SGML (Standardized General Markup
Language)
• Vint Carf and Robert Kahn developed the concept of connecting
networks of computers into an ‘internet’ and developed the
Transmission Control Protocol (TCP).
1975 P.H. O’Farrell announced two-dimensional SDS polyacrylamide gel
electrophoresis
• E.M. Southern published experimental details for Southern Blot
analysis.
• Bill Gates and Paul Allen found Microsoft Corporation.
1976 Prosite database was reported by Bairoch et al.
• The Unix-To-Unix Copy Protocol (UUCP) was developed at Bell
Labs
1977 Fredrick Sanger, Allen Maxam and Walter Gilbert pioneered DNA
sequencing.
• The full description of the Brookhaven PDB was published by F.F.
Bernstein et al.
1978 The first Usenet connection was established between Duke and the
University of North Carolina at Chapel Hill by Tom Truscott, Jim Ellis
and Steve Bellovin
1980 Mark Skolnick, Ray White, David Botstein and Ronald Davis created
RFLP marker map of human genome.
• The first complete gene sequence for an organism (FX 174) was
published.
• Wuthrich et al. published a paper detailing the use of
multidimensional NMR for protein structure determination.
• IntelliGenetics Inc. was founded in California. Their primary
product was the IntelliGenetics Suite of programs for DNA and
protein sequence analysis.
• The Smith – Waterman algorithm for sequence alignment was
published.
• US Supreme Court holds that genetically – modified bacteria are
patentable.
1981 IBM introduced its personal computer to the market
• Human mitochondria DNA was sequenced
• D. Benson, D. Lipman and colleagues developed a menu-driven
program called GENINFO to access sequence database.
• Maizel and Lenk developed various filtering and color display
schemes that greatly increased the usefulness of the dot matrix
method.
1982 First recombinant DNA – based drug was marketed
• Genetics Computer Group (GCG) was created as a part of the
University of Wisconsin at Wisconsin Biotechnology Center.
1983 The Compact Disk (CD) was launched
• Name servers were developed at the University of Wisconsin
1984 Jon Postel’s Domain Name System (DNS) was placed on-line. Apple
computer announced the Macintosh.
1985 Kary Mullis invented PCR
• FASTP algorithm was published
• Robert Sinsheimer made the first proposal for Human Genome
Project
1986 Thomas Roderick coined the term Genomics to describe the scientific
discipline of mapping, sequencing and analyzing genes.
• Amoco Technology Corporation acquired IntelliGenetics. The
Swiss-PROT database was created by the Department of Medical
Biochemistry of the University of Geneva and the European
Molecular Biology Laboratory (EMBL)
• Leroy Hood and Lloyd Smith automated DNA sequencing.
• Charles DeLisi convened a meeting to discuss the possibility of
determining the nucleotide sequence of human genome.
• NSFnet debuts
1987 United States Department of Environment (US DoE) officially began
human genome project.
• The physical map of E. coli is published by Y. Kohara et al.
1988 The use of yeast artificial Chromosome (YAC) is described by David T.
Burke et al.
• Pearson and Lipman published the FASTA algorithm
• The National Centre for Biotechnology Information (NCBI) was
established at the National Cancer Institute in the US.
• PERL (Practical Extraction Report Language) was released by
Larry Wall
• United States National Institute of Health (US NIH) took over
genomic project with James Watson at the helm.
• The Human Genome Initiative was started
• Des Higgins and Paul Sharpe announced the development of
CLUSTAL
• A new program, an internet computer virus designed by a
student, infected 6000 military computers in the USA
1989 NIH established National Centre for Human Genome Research.
• The Genetics Computer group became a private company
• Oxford Molecular Group Ltd (OMG) founded in Oxford, UK,
created products such as Anaconda, Asp, Cameleon and other
(molecular modeling, drug design, and protein design) products.
1990 The BLAST programme to align DNA sequences was developed by
Altschul et al.
• Michael Levitt and Chris Lee founded Molecular Applications
Group in California.
• InforMax was founded in Bethesda, MD
• The HTTP 1.0 specification was published. Tim Berners – Lee
Published the first HTML document.
1991 CERN, Geneva announced the creation of the protocols which make
up the World Wide Web.
• Craig Venter invented expressed sequence tag (EST) technology
• Incyte Pharmaceuticals, a genomics company was formed in
California.
• Myriad Genetics Inc. was founded in Utah with a goal of
discovering major common disease genes and their related
pathways.
• Lius Torvelds announced a Unix – Like separating system which
later became Linux.
1992 Human Genome systems, Maryland was formed by William Haseltin
• Craig Venter established the Institute for Genomic Research (TIGR).
• Mel Simon and coworkers (Cal Tech) invented BACs, crucial for
clone by clone gene assembly.
• Wellcome Trust joined human genome project
1993 Francis Collins took over Human Genome project. Sanger Center is
opened in UK. Other nations joined in the effort. 2005 was projected as
completion year.
• CuraGen Corporation was formed in New Haven, CJ.
1994 Netscape Communications Corporation was founded and it released
Navigator.
• Attwood and Beck published the PRINTS database of protein
motifs.
• Gene Logic is formed in Maryland
1995 Researchers at the Institute for Genomic Research published the first
genome sequence of free-living organism: Haemophilus influenzae.
• Patrick Brown and Stanford university colleagues invented DNA
micro-array technology.
• Microsoft released version 1.0 of Internet Explorer
• Sun released version 1.0 of Java and Netscape released version 1.0
of Java script; version 1.07 Apache was released
• The Mycoplasma genitalium genome was sequenced
1996 The genome of Saccharomyces cerevisiae was sequenced.
• International Human Genome project consortium established
‘Bermuda rules’ for public data release.
• Prosite database was reported by Bairoch et al.
• Affymetrix produced the first commercial DNA chips.
• The working draft for XML was released by W3C
• Structural Bioinformatics, Inc. was founded in San Diego, USA
1997 The genome for E. coli was published
• Oxofed Molecular Group acquired the Genetics Computer Group.
• LION bioscience AG was founded.
• Paradigm Genetics Inc, was founded in North Carolina, USA
- DeCode genetics maped the gene linked to pre-eclampsia
1998 The genomes for Caenorhabditis elegans and baker’s yeast were
published
• Graig Venter forms Celera in Maryland
• Inphamatica, a new Genomics and Bioinformatics company was
established by the University College, London.
• Gene Formatics, a company dedicated to the analysis and
prediction of protein structure and function was formed in San
Diego.
• The Swiss Institute of Bioinformatics was established as a non-
profit foundation
• NIH began SNP project to reveal human genetic variation.
• Celera Genomics proposed to sequence human genome faster and
cheaper than consortium.
1999 Wellcome Trust formed SNP consortium
• First Human Chromosome sequence was published.
2000 The genomes of Pseudonomas aeruginosa, Arabidopsis thaliana and
Drosophila melanogaster were sequenced.
• Pharmacopeia acquired Oxford Molecular Group.
2001 Science and Nature published annotations and analysis of human
genome by mid February.
2002 More genome sequences of other organisms were published.
• Structural bioinformatics and GeneFormatics merged
• Full genome sequence of the common house mouse was published
2004 Rat Genome sequencing project consortium completed the genome
sequence of brown Norway laboratory rat.
2005 4,20,000 Variant SEQr human resequencing sequences were published
on new NCBI probe database
2007 A set of closely related 12 Drozophilidae were sequenced
• Craig Venter published the full diploid genome sequence
2008 Leiden university Medical Center deciphered the completed DNA
sequence of a woman
• G.P.S. Raghava from IMTECH, India developed softwares and
databases for protein structure prediction, genome annotation
and functional annotations of proteins.
All the above mentioned developments have contributed significantly to the
growth of bioinformatics in one way or another.
1.2 SEQUENCING DEVELOPMENT

Before 1945, there was not even a single quantitative analytical method
available for any one protein. However, significant progress with
chromatographic and labeling techniques over the next decade eventually led
to the elucidation of the first complete sequence, that of the peptide hormone
insulin.
The sequence of the first enzyme ribonuclease was complete by 1960. By
1965, around 20 proteins with more than 100 residues had been sequenced,
and by 1980, the number was estimated to be around 1500. Today more than
4,00,000 sequences are available.
Initial Attempts
Initially a majority of protein sequences were obtained by the manual process
of sequential Edman degradation – dansylation. A very important step
towards the rapid increase in the number of sequenced proteins was the
development of automated sequences which, by 1980, offered a 104 fold
increase in the sensitivity compared to the procedure implemented by Edman
and Begg in 1967.
The first complete protein sequence assignment using mass spectrometry
was achieved in 1979. This technique played a vital role in the discovery of the
amino acid γ-carboxyglutamic acid, and its location in the
N-terminal region of prothrombin.
During 1960s and 1970s scientists were finding it difficult to develop
methods to sequence nucleic acids. When the techniques were available, the
first techniques to emerge were applicable only to RNA (ribonucleic acid),
especially transfer – RNAs (tRNA). tRNAs were ideal materials for this early
work, because they were short (typically 74-95 nucleotides in length), and
because it was possible to purify individual molecules.
Advanced Techniques
DNA (deoxyribonucleic acid) consists of thousands of nucleotides and
assembling the complete nucleotide sequence of an entire chromosomal DNA
molecule is a very big task. With the advent of gene cloning and PCR, it
became possible to purify defined fragments of chromosomal DNA. This
paved the way for the development of fast and efficient DNA sequencing
techniques.
By 1977, two sequencing methods had emerged, using chain
termination and chemical degradation approaches. These techniques with
some minor modifications laid the foundation for the sequence revolution of
the 1980s and 1990s and the subsequent birth of bioinformatics.
The polymerase chain reaction (PCR) due to its sensitivity, specificity
and potential for automation, is considered the front-line analytical method
for analyzing genomic DNA samples and constructing genetic maps. Over
the years, incremental improvements in basic PCR technology have enhanced
the power and practice of the technique.
Since the introduction of the first-semi-automated sequence in 1987,
coupled with the development of PCR in 1990 and fluorescent labeling of
DNA fragments generated by the Sanger dideoxy chain termination method,
there have been large-scale sequencing efforts which have contributed
greatly. Technologies for capturing sequence information have also become
advanced over a period of time.
In the early 1980s, researchers could use digitizer pens to manually read
DNA sequences from gels. Then came image-capture devices, which were
cameras that digitized the information on gels. In 1987 Steven Krawetz, helped
to develop the first DNA sequencing software for automated film readers.
In the early 1990s, J. Craig Venter and his colleagues devised a new
method to find genes. Rather than taking the single base chromosomal DNA,
Venter’s group isolated messenger RNA molecules, copied these mRNA
molecules into DNA molecules and then sequenced a part of the DNA
molecule to create expressed sequence tags or ESTs. These ESTs could be used
as handles to isolate the entire gene.
The EST approach also has generated enormous databases of nucleotide
sequences and the development of the EST technique is considered to have
demonstrated the feasibility of high-throughput gene discovery, as well as
provided a key impetus for the growth of the genomics industry.
Sequence Deposits
At the start of 1998, more than 3,00,000 protein sequences have been
deposited in publicly available non-redundant data bases, and the number of
partial sequences in public and proprietary Expressed Sequence Tag (EST)
databases was expected to run into millions. By contrast, the number of 3D
structures in the Protein Data Bank (PDB) is still less than 20000.
The United States Department of Energy (DoE) initiated a number of
projects in 1980s to construct detailed genetic and physical maps of the
human genome. Their aim was to determine the complete nucleotide
sequence of human genome and to localize the estimated 30,000 genes.
Work of such a great dimension required the development of new
computational methods for analyzing genetic map and DNA sequence data,
and demanded the design of new techniques and instrumentation for
detecting and analyzing DNA.
To benefit the public most effectively, the projects also necessitated the
use of advanced means of information dissemination in order to make the
results available as rapidly as possible to scientists and physicians. The
international effort arising from this vast initiative became known as the
Human Genome Project (HGP).
Useful Websites
A very useful guide can be found in the website: http://www.genome.gov/
Education/
Overview of the role, history and achievements of the US Department
of Energy in the HGP can be found in the website: http://
genomics.energy.gov/
Genome Annotation Consortium (GAC) provides comprehensive
sequence-based views of a variety of genomes in the form of an illustrated
guide, with progress charts, etc., and it can be found in the website: http://
www.geneontology.org/GO.refgenome.shtml
Mapping and sequencing the genomes of a variety of organisms have
been taken up and this can be found in the website: http://www.ornl.gov/
sci/techresources/ Human_Genome/publicat/primer/prim2.html
1.3 AIMS AND TASKS OF BIOINFORMATICS

The underlying principle of bioinformatics is that, biological polymers such as
nucleic acid molecule and proteins can be transformed into sequences of
digital symbols. Besides, only limited numbers of alphabets are required to
represent the nucleotide and amino acid monomers.
This flexibility of analyzing the biomolecules with the help of limited
alphabets resulted in the flourishing of bioinformatics. The growth and
performance of bioinformatics rely on the developments in computer
hardware and software. The simplest tasks used in bioinformatics concern
the creation and maintenance of databases of biological information.
Essentially bioinformatics has three components: (i) the creation of
databases allowing the storage and management of large biological data sets,
(ii) the development of algorithms and statistics to determine relationships
among members of large data sets and (iii) the use of these tools for the
analysis and interpretation of various types of biological data, including DNA,
RNA and protein sequences, protein structures, gene expression profiles and
biochemical pathways.
Aims
The aims of bioinformatics are as follows:
(i) To organize data in a way that allows researchers to access existing
information and to submit new entries as they are produced.
(ii) To develop tools and resources that aid in the analysis of data.
(iii) To use these tools to analyze the data and interpret the results in a
biologically meaningful manner.
Tasks
The tasks in bioinformatics involve the analysis of sequence information. This
process involves:
• identifying the genes in the DNA sequences from various organisms.
• Developing methods to study the structure and/or function of newly
identified sequences and corresponding structural RNA sequences.
• Identifying families of related sequences and the development of
models.
• Aligning similar sequences and generating phylogenetic trees to
examine evolutionary relationships.
Besides these, one of the important dimension of bioinformatics is identifying
drug targets and pointing out lead compounds.
Areas
Bioinformatics deals with the following areas:
(i) Handling and management of biological data including its
organization, control, linkages, analysis and so on.
(ii) Communication among people, projects, and institutions engaged in
the biological research and applications. The communication may
include e-mail, file transfer, remote login, computer conferencing,
electronic bulletin boards, or establishment of web-based information
resources.
(iii) Organization, access, search and retrieval of biological information,
documents, and literature.
(iv) Analysis and interpretation of the biological data through the
computational approaches including visualization, mathematical
modeling, and development of algorithms for highly parallel
processing of complex biological structures.
1.4 APPLICATION OF BIOINFORMATICS

Biocomputing has found its application in many areas. Apart from providing
the theoretical background and practical tools for scientists to explore
proteins and DNA, it also helps in many other ways.
In understanding the meaning of sequences, two distinct analytical
themes have emerged: (i) in the first approach, pattern recognition techniques
are used to detect similarity between sequences and hence to infer related
structures and functions and (ii) ab initio prediction methods are used to
deduce 3D structures and ultimately to infer function directly from the linear
sequence. The direct prediction of protein three-dimensional structure from
the linear amino acid sequence is the objective of bioinformatics.
1.4.1 Sequence Homology Analysis

One of the driving forces behind bioinformatics is the search for similarities
between different biomolecules. Apart from enabling systematic organization
of data, identification of protein homologues has some direct practical uses.
Theoretical models of proteins are usually based on experimentally solved
structures of close homologues.
Wherever biochemical or structural data are lacking, studies could be
carried out in yeast like lower organisms and the results can be applied to
homologues in higher organisms such as humans. It also simplifies the
problem of understanding complex genomes by analyzing simple organisms
first and then applying the same principles to more complicated ones. This
would result in identifying potential drug targets by checking homologues of
essential microbial proteins.
1.4.2 Drug Design

The adoption of a bioinformatics-based approach to drug discovery provides
an important advantage. With bioinformatics, genotypes associated with
pathophysiologic conditions could be defined, which might lead to the
identification of potential molecular targets. Given the nucleotide sequence,
the probable amino acid sequence of the encoded protein can be determined
using translation software.
Sequence research techniques could then be used to find homologues in
model organisms; and based on sequence similarity it is possible to model the
structure of the specific protein on experimentally characterized structures.
Finally, docking algorithms could design molecules that could bind to the
model structure, leading the way for biochemical assays to test their
biological activity on the actual protein.
1.4.3 Predictive Functions

Through large-scale screening of data, one can address a number of
evolutionary, biochemical and biophysical questions. We can identify (a)
specific protein folds associated with certain phylogenetic groups, (b)
commonality between different folds within particular organisms, (c) the
degree of folds shared between related organisms, (d) the extent of relatedness
derived from traditional evolutionary trees, and (e) the diversity of metabolic
pathways in different organisms.
One can also integrate data on protein functions, given the fact that
particular protein folds are often related to specific biochemical functions.
Combining expression information structural and functional classifications of
proteins, one can predict the occurrence of a protein fold in a genome, which
is indicative of high expression levels. In conjunction with structural data,
one can compile a map of all protein-protein interactions in an organism.
1.4.4 Medical Areas

Applications in medical sciences have centered on gene expression analysis.
This usually involves compiling expression data for cells affected by different
diseases and comparing the measurements against normal expression levels.
Identification of genes that are expressed differently in affected cells provides
a basis for explaining the causes of illness and highlights potential drug
targets.
With this one would design compounds that bind to the expressed
protein. Given a lead compound, microarray experiments can be sued to
evaluate responses to pharmacological intervention; it can also help in
providing early tasks to detect or predict the toxicity of trial drugs.
If bioinformatics is combined with experimental genomics, a lot of
advances could be made to revolutionize the future healthcare programs. This
involves postnatal genotyping to assess susceptibility or immunity from
specific diseases and pathogens; prescription of a unique combination of
vaccines; minimizing the healthcare costs of unnecessary treatments and
anticipating the onslaught of diseases later in life, which could lead to
guidance for nutrition intake and early detections of any illness.
In addition, drug-based treatments could be tailored specifically to the
patient and disease, this providing the most effective course of medication
with minimal side effects. Human genome project will benefit forensic
sciences, pharma industries, discovery of beneficial and harmful genes,
contribute to a better understanding of human evolution, diagnosis of disease
and disease risks, genetics of response to therapy and customized treatment,
identification of drug targets and gene therapy.
1.4.5. Intellectual Property Rights

Intellectual Property Rights (IPR) are essential part of today’s business. IPRs
are the means to protect any intangible asset. Examples of IPR are Patent,
Copyright, Trademark, Geographical indication and Trade Secret. A patent is
an exclusive monopoly granted by the Government to an inventor over his
invention for limited period of time.
Major areas of bioinformatics which need intellectual property
protection are (a) analytical and information management tools (e.g. modeling
techniques, databases, algorithms, software, etc.), (b) genomics and proteomics
and (c) drug discovery/design.
Innovations
Majority of bioinformatics innovation involves applications of computer-
implemented protocols or software in collecting and/or processing biological
data. These inventions fall within the general category of computer related
inventions called inventions implemented in a computer and inventions
employing computer readable media. These inventions have two aspects (a)
software and (b) hardware.
For example, a computer based system for indentifying new nucleotide
sequence clusters from a given set of nucleotide sequences based on sequence
similarity may comprise an input device, a memory and a processor as
hardware components of the system and a data set or method of operating
instructions stored in the memory and operable by the processor as a
software for the system. Patent protections would be invaluable in protecting
methods, which use computational power, such as sequence alignments,
homology searches and metabolic pathways modeling.
Genomics and Proteomics

Genomics involves isolation and characterization of gene and assigning a
function or use to the gene sequence, i.e., either expression of a particular
protein or identification of the gene as a marker for a particular disease. This
work involves a great deal of laboratory experiments as well as
computational techniques. These techniques can also be protected under IPR.
Proteomics involves purification and characterization of proteins using
technologies like 2D-electrophoresis, multidimensional chromatography and
mass spectroscopy. The application of these techniques to characterization
and finding relation of the protein, (i.e. marker with a particular disease) is
challenging, time consuming and needs heavy investment.
Drug design by modeling which involves computer and computation can

also be protected under IPR. Table 1.1. gives some examples of patents in
bioinformatics.
Table 1.1. Some examples of patents in bioinformatics

Code Number Specific title
1. US 6,355,423 Methods and devices for measuring differential gene expression
2. US 6,334,099 Methods for normalization of experimental data
3. US 5,579,250 Method of rational drug design based on ab initio computer simulation of
conformational features of peptides
4. WO 98/15652 DNA sequencing and RNA sequencing using sequencing enzyme
5. EPI 108779 Spatial structures of at least one polypeptide
6. EPO 807687 Recombinant protease purification and computer program for use in drug
design.
1.5 CHALLENGES AND OPP ORTUNITIES

There are numerous challenges:
(i) We must be able to deal with increasingly complex data and to
integrate data sources into a single system.
(ii) Diverse types of data must be handled simultaneously to provide a
better understanding of what genes do.
(iii) Data have to be annotated, filtered and visualized better.
(iv) Genomics and gene expression data have to be integrated more
effectively.
(v) Better methods have to be evolved to predict structures of protein
from sequences.
(vi) Better methods have to be designed to identify drug candidates.
There are numerous opportunities as well:
(i) Trained and skilled bioinformaticists are needed by many
bioinformatics and drug companies.
(ii) Research and academic institutions are looking for trained people.
(iii) Trained people will be useful in the identification of useful genes
leading to the development of new gene products.
(iv) Skilled bioinfomaticists will contribute greatly in genomics and
proteomics research.
(v) Bioinformaticists will help in revolutionizing drug development and
gene therapy.
(vi) Bioinformaticists will be able to analyze the patterns of gene
expression with computer algorithms.
(vii) Bioinformaticists will help to understand toxic responses and to
predict toxicity.
STUDY QUESTIONS
1. What is bioinformatics?
2. What is the contribution of Rosalind and Wilkins?
3. Who produced the first recombinant DNA organism?
4. Who invented the E-mail program?
5. When was Compact Disk (CD) launched?
6. Who developed BLAST program?
7. Who published PRINTS database?
8. In which year the human genome annotations were published?
9. Write a short history on sequencing.
10. What are the aims of bioinformatics?
11. What are the tasks in bioinformatics?
12. What are the various applications of bioinformatics?
13. What is a patent?
14. Give some examples of patents in bioinformatics.
C H A P T E R
Computers, Internet, World

2
Wide Web and NCBI
Computers are now an integral part of the biological world and without
them advancements in biology and medicines would undoubtedly be
hindered greatly. Computers are essential for the management of ever-
growing biological data.
Internet is a communication revolution. Web has been instrumental in
making Internet a success. It allows the user to move freely anywhere on this
single largest source of information highway. Computers are handling large
quantities of data and help in probing the complex dynamics observed in
nature.
The data can be organized in flat files and spread sheet. They can be
stored in hierarchical files and relational files.
2.1 COMPUTERS AND PROGRAMS

Computer is an electronic machine that is used to store information and
process it in the binary mode. It can perform mathematical operations and
symbol processing. Computer is madeup of transistors, capacitors and
resistors. Bioinformatics would not be possible without advances in
computing hardware and software. Fast and high-capacity storage media are
essential to store information. Information retrieval and analysis require
programs.
Software is a collective term for various programs that can run on
computers. Hardware refers to physical devices such as the processor, desk
drives and monitor. Software is divided into two categories: system software
and application software. System software comprises computer’s operating
system and any other programs required to run applications, while
application software is installed by the user for specific purposes.
Computer programs are written in a variety of programming
languages: machine code, assembly languages and higher-level languages.
Programs written in assembly or higher-level programming languages must
be converted into machine code by assembly and compilation.
In Windows, files in machine code are known as executable files and
files in UNIX systems are known as executable images. These are run by
computer’s processor. Scripts are files executed by another program.
Microsoft Visual Basic, Java Script and PERL are scripting languages.
Programming Languages
There are many programming, scripting and markup languages which are
popular with bioinformaticists. HTML is a language used to specify the
appearance of a hypertext document, including the positions of hyperlinks.
HTML is not a programming language.
Java Script is a popular scripting language that adds to the functionality
of hypertext document, allowing web pages to include such features as pop-up
windows, animations and objects that change in appearance when the mouse
cursor moves over them.
Java is a versatile and portable programming language that is designed
to generate applications that can run on all hardware platforms. The Java
source code is C++. Java is different from Java Script. Java applet is used in
hypertext document. PERL (Practical Extraction and Reporting Language) is
a versatile scripting language which is widely used in the analysis of
sequence data. XML (Extensible Markup Language) allows files to be
described in terms of the type of data they contain.
PERL and PYTHON are the most suitable languages for the work of
bioinformatics due to their efficiency and ability to meet diverse functional
requirements of the field. PERL was invented by Larry Wall using languages
like sed, awk, UNIX shell and C.
PERL can do excellent pattern matching, has a flexible syntax or
grammar and requires fewer codes for programming. It is good at string
processing, i.e. doing things like sequence analysis and database management.
It takes care of memory allocation. It has smooth integration with UNIX based
system. It is available free from the NET to copy, compile and print. PERL can
be downloaded from its home page: http://www.perl.org/.
PYTHON is a complete subject oriented scripting language developed by
Guido Van Rossum in 1998. It has tools for quick and easy generation of
graphical user interface, a library for functions of structural biology and a
mature library for numerical methods.
Bioinformatic Sequence Markup Language (BSML) graphically describes
genetic sequences and methods for storing and transmitting encoded sequence
and graphic information. Biopolymer Markup Language (BIOML) is a data
type definition for the annotation of molecular biopolymer sequence
information and structure data.
Operating Systems
The operating system is a master program that manages all peripheral
hardware and allows other software applications to run. BIOS (Basic Input-
Output System) is a low-level operating system which is largely or entirely in
firmware (i.e. software stored in read-only memory).
Computers, Internet, World Wide Web and NCBI 2.3
BIOS handles activities such as deciding what to do when the computer
is switched on after a cold start, reading and writing to disks, responding to
input, displaying readable characters on the monitor and producing
diagnostics. The higher-level operating system then takes over, and the
computer acquires a typical graphical user interface (GUI) such as Windows.
Files that contain instructions for the operating system are called batch files
in Windows and Shell scripts in UNIX systems.
Windows owned by Microsoft Corporation is the most familiar operating
system on home and office PCs. Most commercial workstations and servers
run under variations of an operating system called UNIX. GNU and LINUX
conform to UNIX standard.
The operating system allows one to have an access to the available files
and programs. UNIX is a powerful operating system for multi-user
component environment. The software that powers the web was invented on
UNIX. UNIX is rich in commands and possibilities, which includes
everything from networking software to word processing software and from
e-mail to newsreaders. It also provides free access to downloading of
programs installed on the UNIX systems. UNIX has many varieties and
versions.
LINUX is regarded as an open source version of UNIX, as it can be
downloaded and installed free of cost. Under LINUX, the PCs prove to be
highly elastic and useful workstations. It is also enabled with important
packages for computational biology. IBION is a recent, complete and self-
contained bioinformatics system. It is a ground breaking server, an appliance
for bioinformatics that has apache web server, a postgreSQL relational
database, the R statistical language on an Intel-based hardware system with
preinstalled LINUX and a comprehensive suite of bioinformatics tools and
databases.
Usually computer software is obtained on floppy disks or compact
disks (CDs). A file is downloaded when it is copied from a remote source
onto a local computer. A file is uploaded when it is copied from a computer’s
hard drive to a remote source. Downloading from the internet is achieved in
the following three ways: (i) directly from a hypertext document, (ii) from an
FTP server or (iii) by e-mail.
2.2 INTERNET
The interplay between the Internet, the World Wide Web, and the global
network of biological information and service providers has made the
bioinformatics revolution possible. The Internet is a global network of
computers and computer networks that links government, academic and
business institutions. This allows computers to talk to each other in their own
electronic languages. Biological information is stored on many different
computers around the world. The easiest way to access this information is to
join all those computers in a network.
Computers are connected in a variety of ways, most commonly by
telephone cables and satellite links, thus allowing data to be exchanged
between remote users. In order to function effectively, the networks share a
communication protocol called Transmission Control Protocol/Internet
Protocol, better known as TCP/IP. TCP determines how data are broken into
packages and reassembled. IP determines how the packets of information are
addressed and routed over the network. Such a shared pattern of
communication means that different types of machines are able to speak to
each other in a common way.
Computers within the network are referred to as nodes, and these
communicate with each other by transferring data packets. For transfer, data
are first broken into small packets (units of information), which are sent
independently and reassembled when they arrive at their destination. But
packets do not necessarily travel directly from one machine to another; they
may pass through several computers on route to their final destination. Even
if any of the nodes on the way are down, the network protocols are designed
to find an alternative route because of the availability of different routes.
Access
The Internet provides a means to distribute software and enables researchers
to perform sophisticated analysis on remote servers. Till the late 1980s, there
were mainly three ways of accessing databases over an Internet: electronic
mail servers, File Transfer Protocol (FTP) and TELNET sever. E- mail serves
as a means of communicating text messages from one’s computer to some
other computer. FTP is a means of transferring computer files such as
programs from remote machines. TELNET is an internet protocol that allows
the user to connect to computers at remote locations and use these computers
as if they were physically operating the remote hardware.
Electronic mail services allow researchers to send an electronic mail
query to the mail server’s Internet address. The researcher’s query will then
be ceased by the cover, and the result will be sent back to the sender’s
mailbox. However, it had its own disadvantages such as poor querying with
errors and too much time. With File Transfer Protocol, the researcher could
download the entire databases search locally. This too has its own drawback
that a researcher should have to download each and every database after
each update.
TELNET allows a user to remotely log onto a computer and access its
facilities. This method is useful for occasional queries. This has its own
disadvantages such as extensive management of user identifications and
overloading of remote computer’s processing power.
Origin
The true origins of the Internet lie with a research project on networking at
the advanced Research Project Agency (ARPA) of the US Department of
Defense in 1969, named ARPAnet. The original ARPAnet connected for the
first time four nodes from different places in the US West Coast, with the
immediate goal of rapid exchange of scientific data on defense-related research
between laboratories.
In 1981, BITnet (Because It’s Time) was introduced, providing point-to-
point connections between universities for the transfer of electronic mails and
files. In 1982, ARPA introduced the TCP/IP allowing different networks to
be connected to and communicate with one another.
Address
Once the machines on a network have been connected to one another, there
must be an unambiguous way to specify a single computer so that messages
and files actually find their intended recipient. To facilitate communication
between nodes, each computer on the Internet is given a unique, identifying
number (its IP address). IP address is unique, identifying only one machine.
It is encoded in a dotted decimal format. For example, one node on the
internet might have the IP address: 130.14.25.1. These numbers represent the
particular machine, the site where the machine is located, and the domain
(and sub domain) to which the site belongs. These numbers help computers
in directing data.
An alternative, hierarchical domain-name system has also been
implemented, which makes Internet addresses easier to decipher. For
example, ncbi.nlm.nih.gov represents the above numbers meaning National
Centre for Biotechnology and Information (NCBI), at National Library of
Medicine (NLM) at National Institute of Health (NIH) and at Government
site (gov).
A complete list of domain suffixes, including country codes, can be found
a t h t t p : / / w w w . c h r i s t c e n t e r e d s t o r e . c o m /
international_domain_extensions_and_suffixes.htm,
http://iwantmyname.com/domains/domain-name-registration-list-of-
extensions.
Connectivity
Normally we can get connected to the Internet through a modem which uses
the existing copper twisted cables carrying telephone signals to transmit
data. Data transfer rates using modem are relatively slow (28.8 to 56 kilobits
per second, [kbps]. A number of new technologies are available for faster
transfer of data. Integrated services digital network (ISDN) is one such
technology but it is costly.
Other cost effective alternatives are using television coaxial cables
which are not used to transmit television signals and hence free to transmit
data at high speed (4.0 megabits per second (Mbps)). Later digital subscriber
line (DSL) with high speed (up to 7 Mbps) and asynchronous DSL (ADSL)
were available. Some of the newer technologies involve wireless and satellite
connections to the Internet.
Most of the people commonly use Internet for electronic mail (e-mail),
newsgroups, file transfer and remote computing. E-mail deals with
communication between individuals; newsgroups are concerned with remote
computing, involving the use, for example, of the File Transfer Protocol (FTP) to
transfer files between machines, and the Telnet protocol, by which users may
connect to computers at different sites and use the machines as if physically
present at the remote location.
The most exciting use of internet is the communication between users in
real-time. These include the UNIX talk protocol (or VMS phone), which is
analogous to holding a telephone conversation, but users speak to each other
by typing into a shared screen. An extension of this concept is conferencing,
whereby groups of people meet and ‘talk’ to each other, again by typing into
a shared interface.
2.3 WORLD WIDE WEB

The World Wide Web (www) is a way of exchanging information over the
Internet using a program called a browser. www was conceived and
developed at European Nuclear Research Council (CERN) in 1989. The
European laboratory for Particle Physics allowed information sharing
between internationally dispersed groups in the High Energy Physics
Community. This led to a medium through which text, images, sounds and
videos could be delivered to users on demand, anywhere in the world.
The concept of information sharing between remote locations, and the
ramifications for rapid data dissemination and communication, found
immediate applications in numerous other areas. As a result, the web spread
quickly and is now making a profound impact in the field of bioinformatics.
Today, the www is the most advanced information system deployed on the
Internet. The web is a hypermedia-based information system. It has become
so popular and powerful, that it has almost become synonymous with the
Internet itself. www is a collection of web pages from all over the world.
The introduction of GOPHER and WAIS (Wide Area Information Server)
in the early 1990s, increased the selection of database accession process. The
world wide web (www) invented by Tim Berners-Lee (CERN) in 1990 replaced
both these protocols. www greatly enhanced the power of cross referencing by
providing active integration of databases over Internet, thus eliminating the
need to download and maintain local copies of databases.
With this a researcher could easily navigate across database entries
through active hypertext cross references with the guarantee to retrieve the
latest information. The first molecular biology web server to be set up was
ExPasy (Expert Protein Analysis System) in 1993 by Geneva University
Hospital and University of Geneva.
Web Pages and Websites

Web pages are the documents that appear in the web browser window when
we surf the www. Each document displayed on the web is called a web page,
and all of the related web pages of a particular server are collectively called a
website. Their content is similar to plain text documents, except that they are
much more flexible as they may contain links to other pages and files around
the world.
Website is a collection of relevant web pages and stored on one computer.
Each web site on the internet has a unique address. The most important feature
of web page is links. A link in a web page allows one to jump to another page
anywhere in the current website or even to another page on another computer
website anywhere in the world.
The greatest asset of the www is its simplicity, providing access to static
pages with highlighted text that can by a click of the mouse allow users to
traverse related pages of widely dispersed information.
Object Web
Object web is designed to support highly functional and interactive systems.
It is a multi-tier architecture that contains two objects and communication
layer. One object may represent the user interface, and the other may provide
some computation. To communicate between the two objects, it is necessary
to define the messages they might receive.
The messages between two or more objects are mediated by a special
piece of code (an Object Request Broker (ORB) on each machine capable of
understanding the message of definitions and able to translate them into the
specific language of each object. With the object web a system can be broken
down into its constituent components written in different languages and
running on different hardware systems.
The Common Object Request Broker Architecture (COBRA) provides the
standards that make this communication possible. It provides a language to
define the structure of the messages, the Interface Definition Language (IDL),
and the architecture for the mediators, the ORBs. ORBs transparently hide all
of the communication between distributed objects, and form the backbone
(wiring for the object web).
2.4 BROWSERS AND SEARCH ENGINES

The full potential of the Internet was realized only with the advent of browsers,
which for the first time allowed easy access to information at different sites.
Browsers are clients that communicate with servers, using a set of standard
protocols and conventions.
The first point of contact between a browser and a server is the home
page. Once the browser has loaded its initial page, it then provides an easy to
use interface with which to retrieve documents, access files, search databases,
and so on. Some of the most commonly used browsers are Firefox, Chrome,
Safari, Opera, Lynx, Mosaic, Netscape Navigator and Internet Explorer.
Search engines are those which help to launch searches. There are many
general purpose search engines such as Google, Yahoo, Microsoft, etc. which
are very useful.
Lynx and Mosaic
Lynx was developed in the Academic Computing Services at the University of
Kansas, USA as part of an effort to construct a campus-wide information
system. It runs on UNIX or VMS operating systems, providing a text-only
interface via low-cost dumb display devices, such as the ubiquitous VT 100
terminal (or emulator).
Mosaic was developed in 1993 at the National Centre for
Supercomputing Application (NCSA), University of Illinois, and Urbana-
Champaign, USA. As a hypermedia system designed for X-windows, Apple
Mac and Microsoft Windows platform, it provided a single, user-friendly
interface to the diverse protocols, data formats and information servers
available throughout the Internet.
Netscape navigator and Internet Explorer

Netscape Navigator was developed in 1994 by Netscape Communication
Corporation, Mountain View, California, USA. It was prepared as an
alternative to Mosaic. It is now the most popular package for browsing
information on the Internet. Current versions of the software include facilities
such as Internet, email, frames, real-time communication, audio-video
support and the latest technology to support creation of visually exciting,
fully interactive pages (e.g. with Java applets).
Internet explorer was developed in 1995 by Microsoft Corporation,
Redmond, USA. It was based on NCSA Mosaic and is designed to work with
PC-based operating systems. It offers the familiar functionality of other
hypermedia browsers, including support for frames, Java and ActiveX.
Users can navigate by clicking on specific text, buttons, or pictures. These
clickable items are collectively known as hyperlinks.
Hyperlinks
Hyperlinks are usually characterized by being highlighted in some way,
either by using a different color from the main body of the text or by being
boxed etc. Selecting a highlighted link calls up the linked document,
regardless of its location, whether on the same server, or on a server in a
different country. Communication between hyperlinks is transparent.
Each hypertext document has a unique address known as a uniform
resource locator (URL). URLs take the format http://restofaddress. The
communication protocol used by web servers is Hyper Text Transport Protocol
or http. Rest of address provides a location for the hypertext document on the
Internet.
HTML
Hyper text documents are written in a standard markup language known as
Hyper Text Markup Language or HTML. HTML code is strictly text-based, and
any associated graphics or sounds for that document exist as separate files in
a common format. Markups instructions permit the web author to render in
bold type (the <B> symbol), to insert horizontal rulers (<HR>), images
(<IMG>), and so on; each of these modes is switched off with the relevant </>
symbol (e.g. </B>).
Another technology to support the creation of a functional genetic data
warehouse is XML. XML stands for extensive markup language. SML,
HTML, can build web pages. XML tags data in a way that any application
can use. It provides a general language for representing data in a standard
format. It allows files to be described in terms of the types of data they
contain.
XML is more flexible and robust. It provides the method for defining
the meaning or semantics of the document. It has the advantage of
controlling not only how data are displayed on a www page, but also how
the data are processed by another program or by a database management
system (DBMS).
2.5 EMBNET AND SRS

Computers store sequence information as simple rows of sequence characters
called strings. Each character is stored in binary code in the smaller unit of
memory, called a byte. Each byte comprises 8 bits, with each bit having a
possible value of 0 or 1, producing 255 possible combinations. A DNA
sequence is usually stored and read in the computer as a series of 8-bit words
in this binary format. A protein sequence appears as a series of 8-bit words
comprising the corresponding binary form of amino acid letters. Normally
DNA and protein sequences are presented in standard ASCII file and in
FASTA format.
A network was established in 1988 to link European laboratories that
used biocomputing and bioinformatics in molecular biology research. The
network, known as EMBnet, was developed to provide information, services
and training to users in dispersed European laboratories, via designated
nodes operating in their local languages. Later this establishment removed
the necessity for individual institutions to keep up-to-date copies of a range
of biological databases, to install search tools, to buy expensive commercial
software packages, etc.
Nodes and Sites

Now EMBnet operates 34 nodes. Of these, 20 are designated National nodes.
Respective nations have a mandate to provide databases, software and online
services (including sequence analysis, protein modeling, genetic mapping,
etc.), to offer user support and training and to undertake research and
development. Eight EMBnet nodes are specialist sites. These are academic,
industrial or research centers that are considered to have particular knowledge
of specific areas of bioinformatics. They are largely responsible for the
maintenance of biological databases and softwares.
A further six sites have been accepted within EMBnet as Associate
Nodes. These are biocomputing centers from non-European countries that
serve their user communities with the same kinds of service, as might a
typical National Node. Most of these offer up-to-date access to sequence
databases and analysis software, together with a variety of tools for molecular
modeling, genome management, genetic mapping and so on. Table 2.1 gives a
list of EMBnet Associate Nodes.
Table 2.1: EMBnet Associate Nodes
Abbreviation Country Site
MIPS/GSF Germany http://mips.gsf.de/

South Africa http://www.cpgr.org.za
CPGR National, http://www.embnet.org/about/members
Other all Specialist,
EMBnet nodes and Associate Nodes
Sequence Retrieval System

Sequence Retrieval System (SRS), is a network browser for databases in
molecular biology. This was evolved to help EMBnet users. SRS allows any
flat-file database to be indexed to any other. Its advantage is that the derived
indices may be rapidly searched, allowing users to retrieve, link and access
entries from all the interconnected resources. This can be readily customized
to use any defined set of databanks.
The source links nucleic acid, EST, protein sequence, protein pattern,
protein structure, specialist/boutique and/or bibliographic databases. SRS is
thus a very powerful tool, allowing users to formulate queries across a range
of different database types via a single interface, without having to worry
about underlying data structures, query languages and so on.
SRS is an integrated system for information retrieval from many
different sequence database, and for feeding the sequences retrieved into
analytical tools such as sequence comparison and alignment programs. SRS
can search a total of 141 databases of protein and nucleotide sequences,
metabolic pathways, 3D structures and functions, genomes, disease and
phenotype information. These include many small databases such as the
Prosite and Blocks databases of protein structural motifs, transcription factor
databases, and databases specialized to certain pathogens.
In addition to the number and variety of databases to which it offers
access, SRS offers tight links among the databases, and fluency in launching
applications. A search in a single database component can be extended to a
search in the complete network, i.e., entries in all databases pertaining to a
given protein can be found easily. Similarity searches and alignments can be
launched directly without saving the responses in an intermediate file. The
parent URL of SRS is: http://srs.ebi.ac.uk/
2.6 NCBI
The National Centre for Biotechnology Information (NCBI) was established in
1988 in USA as a division of the National Library of Medicine and is located
on the campus of the National Institute of Health in Bethesda, Maryland.
The role of the NCBI is to develop new information technologies in
aiding our understanding of the molecular and genetic processes that
underlie health and diseases. Its specific aims include the creation of
automated systems for storing and analyzing biological information, the
development of advanced methods of computer-based information
processing, the facilitation of user access to databases and software, and the
coordination of efforts to gather biotechnology information worldwide.
NCBI also maintains GenBank, the NIH DNA sequence database.
Groups of annotators create sequence data records from the scientific
literature and together with information acquired directly from authors, data
are exchanged with the international nucleotide databases, EMBL and DDBJ.
All resources are available from the NCBI home page www.ncbi.nlm.nih.gov.
Entrez
Entrez is the integrated, text based search and retrieval system. Just like SRS
for EMBnet, Entrez facility was evolved at NCBI to allow retrieval of
molecular biology data and bibliographic citations from NCBI’s integrated
databases. Entrez permits related articles in different databases to be linked
to each other, whether or not they are cross-referenced directly.
Entrez provides access to DNA sequence (from GenBank, EMBL and
DDBJ), protein sequence (from SWISS-PROT, PIR, PRF SEQDB, PDB and
translated protein sequence from the DNA sequence databases), genome and
chromosome mapping data, 3D protein structures from PDB, and the
PubMed bibliographic database.
Links between various databases are a strong point of NCBI’s system.
The starting point for retrieval of sequence and structure is called Entrez. It is
a www-based data retrieval system. It integrates information held in all
NCBI databases. It is the common front-end to all the databases maintained by
the NCBI and it is extremely easy to use. In total, Entrez links to 11 databases
(Table 2.2). Entrez can be accessed via the NCBI web site at the following URL:
http://www.ncbi.nlm.nih.gov/Entrez/
Data Model
The NCBI introduced the use of model for sequence-related information. This
made possible the rapid development of software and the integration of
databases that underlie the popular Entrez retrieval system and on which the
GenBank database is built. The advantages of the model are the ability to move
effortlessly from the published literature to DNA sequences to the proteins they
encode, to chromosome maps of the genes, and to the three-dimensional
structures of the protein.
Table 2.2: The databases covered by Entrez, listed by category.
Category Databases
1. Nucleic acid sequences Entrez nucleotides: sequences obtained from GenBank, RefSeq and
PDB
2. Protein sequences Entrez protein: Sequences obtained from SWISS-PROT, PIR, PRF,
PDB, ad translation from annotated coding regions in GenBank and
RefSeq.
3. 3D structures Entrez Molecular Modeling Databases (MMDB)
4. Genomes Complete genome assemblies from many sources
5. PopSet From GenBank, set of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population
6. OMIM Online Mendelian Inheritance in Man
7. Taxonomy NCBI Taxonomy Database
8. Books Bookshelf
9. ProbeSet Gene Expression Omnibus (GEO)
10. 3D domains Domains from the Entrez Molecular Modeling Database (MMDB)
11. Literature PubMed
The NCBI data model deals directly with a DNA sequence and a
protein sequence. The translation process is represented as a link between the
two sequences rather than an annotation on one with respect to the other.
Protein related annotations, such as peptide cleavage products, are
represented as features annotated directly on the protein sequence. In this
way, it becomes very natural to analyze the protein sequences derived from
translations of CDS features by BLAST or any other sequence search tool
without losing the precise linkage back to the gene. A collection of a DNA
sequence and its translation products is called Nuc-prost set.
The NCBI data model defines a sequence type as a segmented sequence.
GenBank, EMBL and DDBJ represent constructed assemblies of segmented
sequences as contigs. Entrez shows this as a line connecting all its component
sequences.
Retrieval and Application

There are two main reasons for putting data on a computer: retrieval and
discovery. Retrieval is the ability to get back what was put in. Amassing
sequence information without providing a way to retrieve makes the sequence
information useless. It is more valuable to get back from the system more
knowledge than was put in. This will help in biological discoveries. Scientists
can make these kinds of discoveries by discerning connections between two
pieces of information that were not known when the pieces were entered
separately into the database or by performing computations on the data that
offer new insight into the records.
In the NCBI data model, the emphasis is on facilitating discovery; that
means the data must be defined in a way that is amenable to both linkage and
computation. NCBI uses four core data elements: bibliographic citations, DNA
sequences, protein sequences and three-dimensional structures.
In 1992, NCBI began assigning GenInfo Identifiers (gi) to all sequences
processed into Entrez, including nucleotide sequences from DDBJ/ EMBL/
GenBANK, the protein sequences from the translated CDS features, protein
sequences from SWISS-PROT, PIR, FRF, PDB, patents and others. The gi is
assigned in addition to the accession number provided by the source
database. The gi is simply an integer number, sometimes referred to as a GI
number. It is an identifier for a particular sequence only and it is stable and
retrievable.
Bioseq
The Bioseq, or biological sequence, is a central element in the NCBI data
model. It comprises a single, continuous molecule of nucleic acid or protein,
thereby defining a linear, integer coordinate system for the sequence. A
sequence cannot is a self-contained package of sequence annotations or
information that refers to specific locations on specific Bioseqs. Sequence
alignments describe the relationships between biological sequences by
designating portions of sequences that correspond to each other. This
correspondence can reflect evolutionary conservation, structural similarity,
functional similarity or a random event.
ExPASy
ExPASy (Expert Protein Analysis System) world wide web server (http://
www.expasy.ch) is a service provided by a team at the Swiss Institute of
Bioinformatics (SBI) from 1993. It contains databases and analytical tools
related to proteins and proteomics. The databases include Swiss-PROT,
TrEMBL, SWISS-2DPAGE, PROSITE, ENZYME and SWISS-MODEL. The
analytical tools include similarity searches, pattern and profile searches, post-
translational modification prediction, topology prediction, primary, secondary
and tertiary structure analysis and sequence alignment.
Procedure
Open the internet browser and type the URL address: http://www.expasy.ch.
Pull the drop-down menu at search option. Select Swiss-Prot/TrEMBL. Type
the name of the protein in the TEXT box. Note down the details from the query
page which will show the name of the sequence, the taxonomy classification,
description of protein, the literature regarding the sequence, etc.
Mirrors and Intranet

Different servers providing the same service are called mirrors. To access a
particular website, it is necessary to type the URL in the address bar of the
browser. Many academic institutions have an intranet, which means, a local
network that can be accessed only from computers within the institution. What
makes the web so powerful is its network. Table 2.3 gives a few gateway sites
which are comprehensive.
Table 2.3: Some basic sites for beginners of bioinformatics on the www
1. http://www.ncbi.nlm.nih.gov/
2. http://www.ebi.ac.uk/
3. http://www.expasy.ch/
4. http://www.embl.de/
5. http://www.izb.fraunhofer.de/en.html
6. http://themecraft.net/www/bmn.com
Apart from these, there are a great number of specialist sites with
biological data which can be accessed. General-purpose search engines such
as Google, Yahoo, Bing, Wikipedia, AltaVista and Hotbot are helpful in this.
STUDY QUESTIONS
1. What is a computer?
2. What is software?
3. Give some names of languages used in computer programs?
4. What are the advantages of PERL?
5. What is Internet?
6. How does Internet work?
7. What is World Wide Web?
8. What are browsers? Give some example.
9. How does Netscape Navigator Work?
10. Give details about EMBnet.
11. How is sequence retrieval system useful in bioinformatics?
12. What is the role of NCBI in maintaining sequence databases?
13. What is the use of Entrez?
14. Explain Bioseq and ExPASy.
C H A P T E R
DNA, RNA and Proteins

3
The properties that characterize a living organism (species) are based on its
fundamental set of genetic information – its genome. A genome is composed of
one or more DNA molecules (RNA in some viruses), each organized as a
chromosome. The DNA has all the necessary informations encoded in it, for
the functions of the cell. DNA sequence determines the protein sequence.
Protein sequence determines the protein structure. Protein structure determines
the protein function. Hence it is important to understand the fundamental
aspects of DNA, RNA and protein and their interaction.
3.1 BACKGROUND
Already by 1866 Gregor Mendel suggested that factors of inheritance were
existing in pea plants. In the beginning of twentieth century, it became clear
that Mendel’s factors were related to parts of the cell called chromosomes.
Chromosomes are thread like strands of chemical material located in the cell
nucleus.
Also, during this time, geneticists began using the terms ‘inheritance
unit’ and ‘genetic particle’ to describe the factors occurring on the
chromosomes of Mendel’s pea plants. By 1920s, these terms were discarded
and the word gene was used following the suggestion of Willard Johannsen.
Scientists viewed the gene as a specific and separate entity located on the cell’s
chromosome.
Initial Studies
In 1869, Friedrich Miescher isolated nucleic acid from nucleus and named this
substance nuclein. Later Phoebus Levene and his coworkers studied the
components of nuclein and gave it a more descriptive and technical name,
deoxyribonucleic acid (DNA). They also identified ribonucleic acids (RNA)
from some organisms.
Their analysis revealed that both nucleic acids contain three basic
components: (i) a five-carbon sugar, which could be either ribose (in RNA) or
deoxyribose (in DNA), (ii) a series of phosphate groups, that is, chemical
groups derived from phosphoric acid molecules, and (iii) four different
compounds containing nitrogen and having the chemical properties of bases.
In DNA the four bases include adenine, thymine, guanine and cytosine; and in
RNA, they are adenine, uracil, guanine and cytosine. Adenine and guanine
are double – ring molecules known as purines; cytosine, thymine and uracil
are single-ring molecules called pyrimidines [Fig. 3.1].
Fig. 3.1 The components of nucleic acid. The first component is a phosphate group, a
derivative of phosphric acid composed of phosphoric, oxygen, and hydrogen atoms. The
second component is a five-carbon sugar, either deoxyribose (in DNA) or ribose (in RNA).
The third is a series of the five nitrogenous bases adenine, guanine, cytosine, thymine, and
uracil. Note the presence of nitrogen. The first two bases are known as purines; the last
three are pyrimidines.
DNA, RNA and Proteins 3.3
Advanced Studies
In 1949, Erwin Chargaff reported that in DNA the amount of adenine is
always equal to the amount of thymine regardless of the source of the DNA
and the amount of cytosine is consistently equal to the amount of guanine.
Chargaff’s observations played an important role in the double helix model of
DNA proposed by James D. Watson and Francis H.G. Crick, apart from the
experimented data of Maurice M.F. Wilkins and Rosalind Franklin which
suggested that the DNA molecule was a helix. (In 1962, Watson, Crick and
Wilkins were awarded the Nobel Prize in Physiology or Medicine.
Unfortunately Franklin had died of cancer in 1958 and because the Nobel
committee does not cite individuals posthumously, she did not share in the
award).
In 1902, Archibald Garrod postulated that a genetic disease is caused by
a change in the ancestor’s genetic material. He also suggested that due to lack
of an enzyme to break down alkapton, alkaptonuria disease occurs (Patients
with this disease expel urine that rapidly turns black on exposure to air. The
color change takes place because the urine contains alkapton, a substance that
darkens on exposure to oxygen. In normal individuals, alkapton [known
chemically as homogentisic acid] is broken down to simpler substance in the
body, but in persons with alkaptonuria, the body cannot make this
transformation, and alkapton is excreted).
In 1940s Beadle and Tatum postulated ‘one gene – one enzyme
hypothesis’ which suggested that the genes of a cell influence the production
of cellular enzymes (An enzyme is a protein that catalyses a chemical reaction
of metabolism while it itself remains unchanged).
Contribution from Biochemists

In 1940s, biochemists reported that cells undergoing protein synthesis posses
an unusually large amount of RNA. They theorized that RNA synthesis could
occur in the nucleus, then the RNA could travel to the cytoplasm, where it
would determine the amino acid sequence in the protein.
In 1961 F.H.C. Crick and his colleagues reasoned that the genetic code of
DNA probably consists of a series of blocks of chemical information, each
block corresponding to an amino acid in the protein. They further
hypothesized that within a single block a sequence of three nitrogenous bases
specifies an amino acid and proved this by experiments also. For their work on
the nature of the genetic code Marshall Nirenburg and Har Gobind Khorana
were awarded the 1968 Nobel Prize in Physiology or Medicine.
In the ensuing years, biochemists demonstrated that the genetic code is
nearly universal: the same three-base codes specify the same amino acids
regardless of whether the organism is bacterium, bee or a plant. The essential
difference among species of organisms is not the nature of the nitrogenous
bases but the sequence in which they occur in the DNA molecule.
Central Dogma
Different sequences of bases in DNA specify different sequences of bases in
RNA, and the sequence of bases in RNA specifies the sequences of amino
acids in proteins (Fig. 3.2). This is the so-called central dogma of protein
synthesis. And as the nucleic acid and protein vary, so does the species of an
organism [Fig 3.3].
DNA RNA Protein

double helix Single strand Polypeptide chain
0 C G 0 G 0
0
Aspartic acid
0 T A 0 A
0 0 U (Asp)
0 0 Alanine
0 (Ala)
0 C G 0 C
0 T A 0 0 U
0 G C 0 0 G
0 C G 0 0 C Alanine
0 0 0 Transalation (Ala)
0 A T 0 U 0
0 A T 0 U 0
0 G C 0 C 0 Phenylalanine
0 A T 0
U 0
(Phe)
0 0 Serine
0
0 A T 0 (Ser)
0 A T 0 0 A A
0 A T 0 0 A Condon A-A-G Lysine
0 G C 0 0 G translate into lysine (Lys)
0 0
0 0
DNA Triplet RNA Triplet Amino Acid Specified
TAC AUG “Start”

ATC UAG “Stop”
AAA UUU Phenylalanine
AGG UCC Serine
ACA CGU Cysteine
GGG CCC Proline
GAA CUU Leucine
GCG CGC Arginine
TTC AAG Lysine
TGC ACG Tyrosine
CCG GGC Glycine
CTA GAU Aspartic acid
Fig. 3.2 Gene expression and protein synthesis. (a) The base code in DNA is used to
formulate a base code in RNA by the process of transcription. The RNA molecule is then
used in translation to encode an amino acid sequence in a protein,
(b) Some selected triplet codes in DNA and RNA and the amino acid specified in the
protein. Note that the RNA code (known as a codon) is the complement of the DNA code
and that certain codons are "start" or "stop" signals.
Genomic DNA
Transcription
mRNA
Translation
Protein
Fig. 3.3 The central dogma states that DNA is transcribed into RNA, which is then
transcribed later into protein.
3.2 DNA
DNA is a linear, double-helical structure (Fig. 3.4). The double-helix is
composed of two intertwined chains madeup of building blocks called
nucleotides (Fig. 3.5). Each nucleotide consists of a phosphate group, a
deoxiribose sugar molecule and one of four different nitrogenous bases:
adenine, guanine, cytosine or thymine. Each of the four nucleotides is usually
designated by the first letter of the base it contains: A, G, C or T.
1.0 nm
0.34 nm
Wide groove
3.4 nm
Narrow groove
2 nm
Fig. 3.4 What the X-ray diffraction photographs revealed about DNA. Watson and Crick
postulated that DNA is composed of two ribbon like "backbones" composed of alternating
deoxyribose and phosphate molecules. They surmised that nucleotides extend out from the
backbone chains and that 0.34 nm distance represents the space between sucessive
nucleotides. The data showed a distance of 34 nm between turns. So they guessed that ten
nucleotides exist per turn. One strand of DNA would only encompass 1 nm width, so they
postulated that DNA is composed of two stands to conform to the 2 nm diameter observed
in the X-ray diffraction photographs.
O
H
H2C
N Thymine
– H N O
O
O P O CH2 O
5 H H
–
O N
3
N Adenine
N
H
–
O N N H
O P O CH2 O
H H
5
–
O N
3
H
Cytosine
N
–
H N O
O
O P O CH2 O
5 H H
–
O N
3 H
P N
N
5¢ T H
3¢ H
P –
N
O N N
5¢ A
3¢
P O P O CH2 O
5 Guanine
–
5¢ C O
3¢
3
P
5¢ G
3¢ OH
OH 3¢ end
Fig. 3.5 The binding of nucleotide to form a nucleic acid. The phosphate group forms a
bridge between the 5'carbon atom of one nucleotide and 3'carbon atom of the next
nucleotide. A water molecule H2O results form union of the hydroxyl group (-OH) formerly at
the 3'-carbon atom and a hydrogen atom (-H) formerly in the phosphate group. The linkage
between nucleotide is a "3'-5' linkage", the bond is called a phosphodiester bond. Note that
the 3' carbon of the lowest nucleotide is available for linking to another nucleotide (this is
called 3' end of the molecule) and that the phosphate group of the uppermost nucleotide
can link to still another nucleotide (this is the 5' end).
Each nucleotide chain is held together by bonds between the sugar and
phosphate backbone of the chain. The two intertwined chains are held
together by weak bonds between bases of opposite chains. There is a lock and
key fit between the bases of the opposite strands, such that adenine pairs only
with thymine and guanine pairs only with cytosine. The bases that form base
pairs are said to be complementary. DNA is replicated by the unwinding of the
two strands of the double helix and the building up of a new complementary
strand on each of the separated strands of original double helix (Fig. 3.6).
Parent
molecule
G
C G
C G
A T
A
G
T A
C G
A T
A
G
C G
C G
A
GC
C
G C
G
A
T
T A
A T
T
A G
G
G C
C T
G A
A
T T
A A
T
A A
G
G
T A
A T G
C G
C
G C A
C
G
A
Old New Old New

strand strand strand strand
Daughter molecule Daughter molecule
Fig. 3.6 The general plan of DNA replication. The double helix unwind, and the two 'old'
strands serve as templates for the synthesis of 'new' stands having complementary bases.
An organism’s basic complement of DNA is called its genome. The

somatic cells of most plants and animals contain two copies of their genomes;
these organisms are diploid. The cells of most fungi, algae, and bacteria
contain just one copy of the genome; these organisms are haploid. The genome
itself is made up of chromosomes, which contain DNA.
Chromosome
Chromosome literally means colored body. Chromosome is the threadlike
structure of chemical material located in the cell nucleus. Genes are encoded in
DNA molecule, which in turn is organized into chromosomes. Based on the
organization of chromosomes, living organisms are classified broadly into
Prokaryotes and Eukaryotes.
Prokaryotic chromosome is very simple in organization. The prokaryotic
chromosome is single, normally circular, double helix of DNA. The nuclear
material does not have distinct nuclear membrane. The eukaryotic
chromosome is double, linear helix of DNA. The nuclear material has a
distinct nuclear membrane and is highly coiled.
In diploid cells, each chromosome and its component genes are present
twice. For example, human somatic cells contain two sets of 23 chromosomes,
for a total of 46 chromosomes. Two chromosomes with the same gene array are
said to be homologous. In eukaryotes, chromosomes occur in pairs.
Centromere
Each chromosome has a constriction called centromere. Depending on the
position of centromere 4 types of chromosome types are seen. If the centromere
is found in the middle of the chromosomes, it is a metacentric type. If the
centromere is slightly away from the middle, it is submetacentric type. If the
centromere is found in the top of the chromosome, it is telocentric type. If the
centromere is very close to the tip, it is acrocentric type. The centromeres are the
sites of attachment of spindle fibres which are formed during cell division.
In many species a separate pair of chromosomes is present for sex
determination and they are referred to as sex chromosomes. All the other
chromosomes are referred to as autosomes. The presentation of complete
diploid set of chromosomes in a diagrammatic manner is called karyotype.
When the chromosomes are photographed using cytological preparation,
and then cut and pasted according to size, it is referred to as ideogram. The
end portions of chromosomes are called telomeres where short multiple repeat
sequences of DNA are arranged.
All living beings contain genetic information in the form of DNA within
their cells. A characteristic of all living organisms is that DNA is reproduced
and passed on to the next generation. DNA contains instructions for making
proteins.
Gene
A gene is a sequence of chromosomal DNA that is required for the production
of a functional product: a polypeptide or a functional RNA molecule. A gene
includes not only the actual coding sequences but also adjacent nucleotide
sequences required for the proper expression of genes.
3.3 RNA
RNA is the other major nucleic acid and it is single-stranded unlike DNA
which is double-stranded. It contains ribose instead of deoxyribose as its
sugar-phosphate backbone, and the uracil (U) instead of thymine (T).
There are three types of RNAs in the cells for use in protein synthesis:
messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA).
mRNA acts as a template for protein synthesis; the rRNA and tRNA form a
part of protein synthesizing machinery. mRNA is produced inside the nucleus
by transcription of protein coding genes by RNA polymerase II. In eukaryotic
systems the coding sequence in gene is not continuous as in prokaryotes (Fig.
3.7). There are a number of noncoding sequences known as introns
interspersed with the coding sequences called exons, the parts of the gene
expressed as protein. Introns do not contain information for functional gene
product such as protein but they contain switches for genes.
Prokaryote gene
Regulatory Coding region Transcription

region for termination
transcription signals
intiation
Eukaryote gene
Introns
Coding region (exons)

Regulatory Transcription
region for termination
transcription signals
intiation
Fig. 3.7 Generalized gene structure in prokaryotes and eukaryotes. The coding region is
the region that contains the information for the structure of the gene product (usually a
protein). The adjacent regulatory regions (light line) contain sequences that are recognized
and bound by protein that make the gene's RNA and by proteins that influence the amount
of RNA made. Note that in eukaryotic gene the coding region is often split into segments
(exons) by one or more noncoding introns. (Source: A.J.F. . Griffiths et al., Modern Genetic
Analysis, W.H. Freeman and Company, 2002)
Pre-mRNA
When the RNA polymerase sweeps down the DNA template with introns and
exons a preliminary mRNA molecule is formed. Therefore, a processing of
premRNA is required to remove the non-coding introns from it. The introns are
removed biochemically; the exons are spliced together to form the functional
mRNA molecule. Splicing makes the coding sequence continuous and the
mRNA emerges as an accurate template for building up of the protein (Fig. 3.8).
Fig. 3.8 The formation of mRNA. A gene consists of exons, the parts of the gene expressed
as protein, and introns, the intervening sequences between the exons. In the formation of
mRNA, the gene is transcribed to a preliminary mRNA molecule. Then the introns are moved
biochemically and the exons are spliced together. This activity results in the funational
mRNA molecule, which is then ready for translation. This type of processing does not occur
in mRNA production in prokaryotic cell such as bacterial cells; it occurs only in eukaryotic
cells such as plant, animal, and human cells.
The processing of pre mRNA also includes modification of the 5’ end

nucleotide which is called capping. The 3’ end is modified by the addition of a
long stretch of 250 adenines. This process is called polyadenylation and the
long tail is called poly A tail (Fig. 3.9). Inside the nucleus due to the action of
RNA polymerase II, a number of species of mRNA are produced. The mRNA
populations inside the nucleus vary in length and in stages of processing.
Such mRNA population is called heterogeneous nuclear RNA (hnRNA).
Polyadenylation signal
(AAUAAA)
Transcription start site Transcription
Translation initiation site Translation termination site
Promoter termination site
GU A AG GU A AG
P
Gene
5¢ UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 3¢ UTR
Addition of cap
Primary RNA transcript

Exon 1 Intron 1 Exon 2 Intron 2 Exon 3
3¢ cleavage
Addition of poly(A) tail Poly(A)
Ploy (A)
Splicing Mature mRNA
Exon 1 Exon 2 Exon 3
Fig. 3.9 Transcriptional and translational landmarks in a eukaryotic gene with two introns
(top line), and the processing of its transcript to make mRNA. Note that since the landmarks
shown are relevant to RNA, U is given in the gene sequence instead of T. (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Splicing
Splicing is carried out inside the nucleus by a group of molecules which have
catalytic function similar to enzymes. That is composed of small RNA
molecules rich in uracil called URNAs or small nuclear RNAs (snRNAs) in
conjunction with small nuclear ribonucleo proteins (snRNPs). There are many
snRNPs such as U1, U2, U4, U5, U6 that are involved in splicing reactions.
The exon intron junction has a specific nucleotide sequence, which is called
signature sequence. This signature sequence is identified by the snRNPs. The
RNA portion of the snRNP interacts with the splice junction nucleotides and
base pair.
In vertebrate animals branch point sequence is present. The U1 snRNP
binds to the 5’ splice site and the U2 snRNP binds to the branch point
sequence. The remaining snRNPs, U5 and U4/U6 form a complex with U1
and U2 causing the intron to loop so that the exons come together.
The combination of the intron and snRNPs is called the spliceosome. The
spliceosomes curl the intron and bring the exon junction and also join the exon
ends (Fig. 3.10). In some unicellular organisms instead of snRNPs, mRNA
itself takes care of splicing with the help of ribonucleases of ribozyme.
Pre-mRNA
GU A AG
Exon 1 Exon 2
Intron
Spliceosome composed
of five different SnRNPs
Spliceosome attached A
to pre-mRNA
1
U A
G G
2
SnRNPs
Spliced exons
Lariat A
Fig. 3.10 The structure and function of a spliceosome. The spliceosome is composed of
several snRNPs that attach sequentially to the RNA, taking up positions roughly as shown.
Alignment of the snRNPs results from hydrogen bonding of their snRNA molecules to the
complementary sequences of the intron. In this way the reactants are properly aligned and
the splicing reactions (1 and 2) can occur. The P-shaped loop, or lariat structure, formed by
the excised intron is joined through the central adenine nucleotide. (Source: A.J.F. Griffiths
et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Capping
Capping is a process by which the 5’ end of mRNA is protected from
exonuclease enzymes. Typically a prokaryotic mRNA remains stable only for a
few minutes. In eukaryoties the half-life of mRNA is around 6h. A nucleotide
may be deleted, added or substituted by RNA editing.
tRNA
tRNAs are adapter-like, small, linking molecules. The function of tRNA is to
fetch the correct amino acid to mRNA molecule and deposit it to the growing
polypeptide chain during protein synthesis. Every amino acid has its own
tRNA. tRNA has two ends. One end has the anticodon. This end base pairs
with Fe codon of mRNA. The other end acts as a socket to attach the amino
acid.
According to the sequence of codons in mRNA, the amino acids are
brought in by tRNAs and a specific polypeptide sequence is thus built. tRNA
molecules have between 74 to 95 nucleotides. tRNAs are produced in a
precursor form called pre-tRNAs. Several tRNA genes are transcribed together
non-stop by RNA polymerase III enzyme. Ribonuclease enzyme cleaves the
tRNA molecule into individual tRNA.
Ribosomes
Ribosomes are macro molecules composed of both RNA and several
polypeptides. Ribosomes provide a firm platform for protein synthesis. Each
ribosome is composed of large and small subunits (Fig. 3.11).
Fig. 3.11. Ribosomes contain a large and a small subunit. Each subunit contains rRNA of
varing lenghts and a set of proteins. There are two principal rRNA molecules in all ribosomes
(shown in the column on the left). Ribosomes form prokaryotes also contain one 120-base-
long rRNA that sediments at 5S, whereas eukaryotic ribosomes have two small rRNAs; a 5S
RNA molecule similar to the prokaryotic 5S, and a 5.8S molecule 160 base long. The
proteins of the large subunit are named L1, L2, etc., and those of the small subunit proteins
S1, S2, etc. (Source: Lodish et al., Molecular Cell Biology, Scientific American Books, Inc.,
1995).
rRNA
The prokaryotic ribosomes are 70s type. The subunits have 50s and 30s values
(s stands for measurement in Swedberg unit). The 50s subunit has two rRNAs
and 31 polypeptides. The 30s subunit has a single rRNA and 21 polypeptides.
In eukaryotes the ribosomes are of 80s types. The subunits have 60s and 40s
values. The 60s subunit has 3rRNAs and about 49 polypeptides. The 40s
subunit has one rRNA and about 33 polypeptides. RNA polymerase 1
transcribes the rRNA genes.
In prokaryotes such as E. coli, there are 7 copies of rRNA genes scattered
throughout the genome. Each gene contains one copy each of 16s, 23s and 5s
rRNA sequences arranged consecutively. The gene is transcribed as single
prerRNA (30s) molecule, which is processed to produce individual rRNAs.
The prerRNA folds into a number of stem-loop structures over which
ribosomal proteins bind. During this time some of the nucleotides of rRNA are
methylated. Finally, the ribonuclease RNAse III cleaves and releases 5s, 23s
and 16s RNAs. Mature rRNAs are formed by further trimming at 5’ and 3’
ends by ribonucleases M5, M16, and M23.
In eukaryotes, the sequences of the 28s, 18s and 5.8s rRNAs are present
in a single gene. This gene exists in multiple copies separated by short non-
transcribed regions. In humans, there are about 200 gene copies occurring in
5 clusters on separate chromosomes. RNA polymerase I transcribes these
genes. Transcription takes place in nucleolus inside the nucleus. In humans
prerRNA is 45s in size. It is processed to yield 28s, 18s and 5.8s rRNAs. The
eukaryotic prerRNA is processed similar to that in prokaryotes. The prerRNA
is cleaved to yield mature 28s, 18s and 5.8s rRNA by ribonucleases. Small
cytoplasmic RNAs (scRNAs) direct protein traffic within the eukaryotic cell.
3.4 TRANSCRIPTION AND TRANSLATION

The biological role of most genes is to carry information specifying the
chemical composition of proteins and the regulatory signals that will govern
their production by the cell. Modern biochemists agree that the process of
protein synthesis is initiated by an uncoiling of the DNA double helix and an
uncoupling of the two strands of DNA. A functional regime of DNA, the gene,
is thereby exposed.
Transcription
The first step taken by the cell to make a protein is to copy or transcribe the
nucleotide sequence in one strand of the gene into a complementary single-
stranded molecule called ribonucleic acid (RNA) Fig. 3.12). Component
nucleotides stored in the region are used for the synthesis, and an enzyme
called RNA polymerase binds the nucleotides together to form the RNA
molecule.
Nontemplate CTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGATCTGCGCTGC 3¢
DNA
strand 5¢
Template GACGGTAACAGTCTGTACATATGGGGCATGCCAGAAGGGCTCGCTTTTGCTAGACGACG 5¢
strand 3¢
5¢ CUGCCAUUGUCAGACAUGUAUACCCCGUACGUCUUCCCGAGCGAAAACGAUCUGCGCUGC 3¢ mRNA
Fig. 3.12. The mRNA sequence is complementary to the DNA template strand from which
it is synthesized and therefore matches the sequence of the nontemplate strand (except that
RNA has U where DNA has T). The sequence shown here is form the gene for the enzyme
β-galactosidase, which is involved in lactose metabolism. (Source: A.J.F. Griffiths et al.,
Modern Genetic Analysis, W.H. Freeman and Company, 2002).
The production of RNA is called transcription, a word coined by Crick in
1956. The fragments so constructed are known as RNA transcripts. These
RNA molecules, together with ribosomal proteins and enzymes, constitute a
system that carries out the task of reading the genetic message and producing
the protein that the genetic message specifies.
The transcription process, which occurs in the cell nucleus, is very
similar to the process for replication of DNA because the DNA strand serves as
the template for making the RNA copy, which is called a transcript. The RNA
transcript, (which in many species undergoes some structural modifications)
becomes a working copy of the information in the gene, a kind of message
molecule called messenger RNA (mRNA). The mRNA then enters the
cytoplasm, where it is used by the cellular machinery to direct the manufacture
of a protein.
Translation
The process of producing a chain of amino acids based on the sequence of
nucleotides in the mRNA is called translation. The nucleotide sequence of a
mRNA molecule is read from one end of the mRNA to the other, in groups of
three successive bases. These groups of three are called codons (AUU, CCG,
UAC). Because there are four different nucleotides, there are 4 × 4 × 4 = 64
different possible codons, each one either coding for an amino acid or a signal
to terminate translation (Table 3.1).
Table 3.1: The genetic code. Notice that an amino acid can be coded by several
different codons. A stop codon does not code for an amino acid, but instead signals
to the ribosome that this is the end of the protein and that translation should cease.
Second letter
U C A G
U UUU Phe UCU Ser UAU Tyr UGU Cys U

UCC UCC UAC UGC C
UCA UCA UAA Stop UGA Stop A
UUG Leu UCG UAG Stop UGG Trp G
C CUU CCU CAU CGU U
CUC CCC Pro CAC His CGC Arg C
CUA Leu CCA CAA CGA A
Third letter
First letter
CUG CCG CAG Gin CGG G

A AUU Ile ACU AAU AGU Ser U
AUC ACC Thr AAC Asn AGC C
AUA ACA AAA AGA Arg A
AUG Met ACG AAG AGG G
G GUU GCU GAU Aus GGU U
GUC Val GCC Ala GAC GGC C
GUA GCA GAA Glu GGA Gly A
GUG GCG GAG GAG G
Because only 20 kinds of amino acids are used in the polypeptides that
make up proteins, more than one codon may correspond to the same amino
acid. For example, AUU, AUC and AUA, all these three codons code for
isoleucine. UUU ad UUC code for phenylalanine. The mRNA molecule
consists of a series of codons formed as RNA polymerase sweeps down the
DNA template.
In a eukaryotic cell, the mRNA molecule now moves through a pore in
the nuclear membrane into the cell cytoplasm. Here it combines with one or
more ribosomes. During this time different amino acids join with their specific
tRNA molecules in the cytoplasm. Once bound together, the different tRNA
molecules get attached to ribosome where mRNA is stationed. One portion of
the mRNA molecule attaches to the 30s subunit and a tRNA molecule with its
amino acid attaches to the 50s subunit.
During this step, the codon of the mRNA attracts a complementary
anticodon on the tRNA. The codon-anticodon matching brings a specified
amino acid into position. The matching thus denotes the amino acid’s location
in the protein chain. At this precise moment, the genetic code of DNA is
expressed as the location of an amino acid in a protein chain.
After pairing with mRNA, the tRNA-amino acid is held in a viselike grip
on the ribosome’s larger subunit. The ribosome then moves along the mRNA to
a new location. Here a second tRNA with its amino acid approaches the
ribosome and pairs its anticodon with the second codon on the mRNA
molecule. Thus, two tRNA molecules and their amino acids stand next to one
another on the mRNA. In a millisecond, an enzyme from the 50s subunit of the
ribosome joins the amino acids together to form a dipeptide (two amino acid in
a chain).
The first tRNA is now free of its amino acid, and it moves back to the
cytoplasm, leaving its amino acid behind and joined to the second amino acid.
Now the ribosome moves to a third location at the third codon of the mRNA. A
new tRNA with its amino acid enters the picture and the process continues
forming a long chain of amino acids called polypeptide (Figs. 3.13a and 3.13b).
The polypeptide bond is formed by the removal of water between amino
acids (Fig. 3.14). The final one or two codons of the mRNA are chain
terminator or ‘stop’ signals. As these codons are reached (UAA, UAG or UGA),
no complementary tRNA molecules exist and no amino acids are added to the
chain. Instead, the stop signals activate release factors to discharge the
polypeptide chain from the ribosome. Now the polypeptide will coil to yield
the functional protein.
The Nature of Chemical Bonds

By definition, elements are things that cannot be further reduced by chemical
reaction. Elements are made of individual atoms, which in turn, are made of
smaller subatomic particles. These are separated by physical reactions. Only
three subatomic particles – neutron, proton and electron – are stable. The
number of proton in the nucleus of an atom determines what element it is.
Generally, for every proton in an atomic nucleus there is an electron in orbit
around it to balance the electrical charges.
Fig. 3.13a The addition of a single amino acid (aa6), carried by the tRNA at the A site, to
the growing polypeptide chain, tethered by the tRNA at the P site, during translation of
mRNA. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and
Company, 2002).
Polypeptide aa1 aa2

aa3
aa4
aa8
aa4 aa5
tRNA aa1
aa2 aa3 aa6 aa7
Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon
mRNA 1 2 3 4 5 6 7 8 9 10 11
Ribosomes
Fig. 3.13b The addition of an amino acid (aa) to a growing polypeptide chain in the
translation of mRNA. Multiple copies of the polypeptide are produced by a train of
ribosomes following each other along the mRNA; two such ribosomes are shown. (Source:
A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
aa1 aa2 aa3
H R2 H R2 H R3
H N C C OH H N C C OH H N C C OH
H O H O H O
Amino Carboxyl
end H R1 H R2 H R3 end
H N C C N C C N C C OH + 2(HO)
H O H O H O
aa1 aa2 aa3
Peptide Peptide
(a) bond bond
Peptide group
1.24
H R
C C
1 1.
32
1.5 1.4
6
C N
R H
(b) H
Fig. 3.14 The peptide bond (a) A polypeptide is formed by the removal of water between
amino acids to form peptide bonds. Each aa indicates an amino acid. R1, R2 and R3
represent R groups (side chains) that differentiate the amino acids. R can be anything from
a hydrogen atom (as in glycine) to a complex ring (as in tryptophan), (Source: A.J.F. Griffiths
et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002). (b) The peptide group
is a rigid planar unit with the R groups projecting out form the C-N backbone. Standard bond
distances are shown in angstroms (Source: Stryer, L., Biochemistry, W.H. Freeman and
Company, 1995).
The higher an atom’s affinity for electrons, the higher its

electronegativity. The slight separation charges within a molecule contribute to
hydrogen bonding. Chemicals can be placed in two categories based on their
affinity or non-affinity to water: hydrophilic (literally ‘water friendly’) or
hydrophobic (literally ‘afraid of water’).
3.5 PROTEINS AND AMINO ACIDS
Proteins are the molecular machineries that regulate and execute nearly every
biological function. Proteins are madeup of amino acids. Each amino acid has
a backbone consisting of an amide (-NH2) group, an alpha carbon, and a
carboxylic acid or carboxylate (-COOH) group. To the alpha carbon, a side
chain is attached. The side chains vary with each amino acid. These side
chains confer unique stereochemical properties on each amino acid.
The amino acids are often grouped into three categories. (i) The
hydrophobic amino acids, which have side chains composed mostly or
entirely of carbon and hydrogen, are unlikely to form hydrogen bonds with
water molecules. (ii) The polar amino acids, which often contain oxygen and/
or nitrogen in their side chains, form hydrogen bonds with water much more
readily. (iii) The charged amino acids carrying a positive or negative charge at
biological pH.
The order of the amino acids in a protein’s primary sequence plays an
important role in determining its secondary structure and ultimately, its
tertiary structure, its physical and chemical properties and ultimately its
biological function.
A chain of several amino acids is referred to as a peptide. Longer chains
are polypeptides. When two amino acids are covalently joined, one of the
amino acids loses a hydrogen (H+) from its amine group, while the other loses
an oxygen and a hydrogen (OH-) from its carboxyl group, forming a carbonyl
(C=O) group (and water, H2O). The result is a dipeptide – two amino acids
joined by a peptide bond – and a single water molecule. In a polypeptide, the
amino acids are sometimes referred to as amino acid residues, because some
atoms of the original amino acid are lost as water in the formation of the
peptide bonds.
Polypeptides have specific directionality. The amino terminus (or N
terminus) of the polypeptide has an unbounded amide group, while the
carboxy terminus (or C terminus) ends in a carboxylic acid group instead of a
carbonyl. Protein sequences are usually considered to start at the N terminus
and progress towards the C terminus.
The non-side chain atoms of the amino acids (constant region of each
amino acid) in a polypeptide chain form the protein backbone. The chemistry
of a protein backbone forces most of the backbone to remain planar. The only
movable segments of the protein backbone are the bonds from the nitrogen to
the alpha carbon (the carbon atom to which the chain is attached) and the
bond between the alpha carbon and the carbonyl carbon (the carbon with a
double bond to an oxygen atom). These two chemical bonds allow for circular
or dihedral rotation, and are often called phi (Φ) and psi (Ψ), respectively.
Thus a protein consisting of 300 amino acids will have 300 phi angles, often
numbered as Φ1, Ψ1 up to Φ300 and Ψ300. All of the various conformations
attainable by the protein come from the rotations of these 300 pairs of bonds.
Only twenty different amino acids are used to produce the countless
combinations found in the proteins of cells (Table 3.2). The polypeptide chain
consisting of amino acids folds into a curve in space by folding pattern.
Proteins show a great variety of folding patterns. Folding may be thought of as
a kind of intramolecule condensation or crystallization.
Table 3.2: The four naturally occurring nucleotides in DNA and RNA and 20
naturally occurring amino acids in proteins
The four naturally occurring nucleotides in DNA and RNA

a-adenine g-guanine c-cytosine t-thymine (u uracil)
The twenty naturally-occurring amino acids in proteins
Non-polar amino acids
G-glycine A-alanine P-proline V- valine
I-isoleucine L-leucine F-phenylalanine M-menthionine
Polar amino acids
S-serine C-cysteine T-threonine N-asparagine
G-gulatamine H-histidine Y-tyrosine W-tryptophan
Charge amino acids
D aspartic acid E-glutamic acid K-lysine R-arginine
Other classifications of amino acids can also be useful. For Instance, histidine, phenylalanine, tyrosine, and
tryptophan are aromatic, and are observed to play special structural roles in memberane proteins Amino acid
names are frequently abbreviated to their first three letters, for instance Gly for glycine, except for isoleucine,
asparagines, glutamine and htryptophan, which are abbreviated to Ile, Asn, Gin and Trp, respectively. The rare
amino acid selenocysteine has the three-letter abbreviation Sec and the one-letter code U.It is conventional to
write nucleotides in lower case and amino acids in upper case. Thus atg-adenine-thymine-guanine and
ATG= Alanine-Threonine-Glycine.
Structure
The linear sequence of amino acids in a protein molecule refers to primary
structure. Regions of local regularity within a protein fold (e.g. α-helices,
β-turns, β-strands) refer to secondary structure. Proteins show recurrent
patterns of interaction between helices and sheets close together in the
sequence. These arrangements of α-helices and/or β-strands into discrete
folding units (e.g. β-barrels, β α β-units, Greek keys, etc.) refer to super-
secondary structures (Fig. 3.15).
The overall fold of a protein sequence, formed by the packing of its
secondary and/or super-secondary structure elements refers to tertiary
structure. The arrangement of separate protein chains in a protein molecule
with more than one subunit refers to quaternary structure. The arrangement of
separate molecules such as in protein-protein or protein-nucleic acid
interactions refers to quinternary structure.
(a)
(b)
(c)
Fig. 3.15 Common supersecondary structures (a) α–helix hairpin, (b) β–hairpin, (c) β-α-β
unit. The chevrones indicate the direction of the chain. (Source: Lesk, A.M., Introduction to
Bioinformatics, Oxford University Press).
Domains
Many proteins contain compact units within the folding pattern of a single
chain that look as if they should have independent stability. These are called
domains. In the hierarchy, domains fall between super-secondary structures
and the tertiary structure of a complete monomer, nodular proteins are multi
domain proteins which often contain many copies of closely related domains.
The most general classification of families of protein structures is based
on the secondary and tertiary structures of protein (Table. 3.3).
Motif
The active site of an enzyme which takes part in catalytic function occupies
only a small portion on the protein molecule. If the protein is stretched into a
polypeptide chain the active site region may be found distributed as discrete
patches on the primary structure. Such conserved small regions which confer
characteristic minor shape to the protein are called motifs. Motifs are short
strings of base pairs characteristic of sites regulating particular events in gene
expression or chromosome replication such as 5’ splice sites or origins of
replication.
Table 3.3: Class and characteristics of protein structures

Class Characteristics
α-helical Secondary structure exclusively or almost exclusively α-helical
β-sheet Secondary structure exclusively or almost exclusively β-sheet.
α+β α-helices and β-sheets separated in different parts of the molecule; absence
of β-α-β super secondary structure
α/ β Helices and sheets assembled from β-α-β units
α-β-linear Line through centers of strands of sheet roughly linear
α-β-barrels Line through centers of strands of sheet roughly circular
Little or no secondary structure
Folding Patterns
Within these broad categories, protein structures show a variety of folding
patterns. Among proteins with similar folding patterns, there are families that
share enough features of structure, sequence and function to suggest
evolutionary relationship. Classification of protein structures occupies a key
position in bioinformatics – as a bridge between sequence and function.
The amino acid sequence of a protein dictates its three dimensional
structure. When placed in a medium of suitable solvent and temperature
conditions, like the one provided by a cell interior, proteins fold spontaneously
to their native active states. If amino acid sequences contain sufficient
information to specify three-dimensional structures of proteins, it should be
possible to device an algorithm to predict protein structure from amino acid
sequence. But this has been difficult. Hence scientists have tried to predict
secondary structure, fold recognition and homology modeling.
Biochemical Nature
Biochemically, proteins play variety of roles in life processes; there are
structural proteins (e.g. viral coat proteins, the horny outer layer of human and
animal skin, and proteins of the cytoskeleton); proteins that catalyse chemical
reactions (the enzymes); transport and storage proteins (hemoglobin);
regulatory proteins, including hormones and receptor; signal transduction
proteins; proteins that control genetic transcription; and proteins involved in
recognition, including cell adhesion molecules, and antibodies and other
proteins of the immune system. Proteins are large molecules. In many cases
only a small part of the structure – an active site – is functional, the rest
existing only to create and fix the spatial relationship among the active site
residues.
Chemical Nature
Chemically, protein molecules are long polymers typically containing several
thousand atoms composed of a uniform repetitive backbone (or main chain)
with a particular side chain attached to each residue. The polypeptide chains
of proteins have a main chain of constant structure and side chains that vary
in sequence. The side chains may be chosen, independently, from the set of
20 standard amino acids. It is the sequence of the side chains that gives each
protein its individual structural and functional characteristics.
Chaperones
Some proteins require chaperons to fold, but these catalqze the process, rather
than directing it. Molecular chaperones are helper proteins that ensure that
growing protein chains fold correctly. Chaperones are thought to block
incorrect folding pathways that would lead to inactive products, by preventing
incorrect aggregation and precipitation of unassembled subunits. They
probably bind temporarily to interactive surfaces that are exposed only during
the early stages of protein assembly.
Functions
Proteins serve several vital functions: (i) for catalyzing various biochemical
reactions (e.g. enzymes), (ii) as messengers 9 e.g. neurotransmitters), (iii) as
control elements that regulate cell reproduction, iv) growth and development
of various tissues (e.g. trophic factors), (v) oxygen transport in the blood (e.g.
hemoglobin), (vi) defense against diseases (e.g. antibodies), etc. The function of
a protein is determined by its shape.
STUDY QUESTIONS
1. Who coined the word gene?
2. Who isolated nucleic acid first?
3. Who gave the name DNA?
4. What is the contribution of Erwin Chargaff?
5. Who proposed the DNA double helix model?
6. Who proposed one gene-one enzyme hypothesis?
7. What is a chromosome?
8. What is a centromere? Name the different types of Centromere.
9. What are the different kinds of RNAs?
10. What is polyadenylation?
11. What is transcription?
12. What is translation?
13. What are the different structures of protein?
14. What is the function of chaperons?
C H A P T E R
DNA and Protein Sequencing

4
and Analysis
Contributions from the field of biology and chemistry have facilitated an

increase in the speed of sequencing genes and proteins. With the advent of
cloning technology it has become easier to insert foreign DNA sequences into
many systems. Rapid mass production of particular DNA sequences, a
necessary prelude to sequence determination, has also become possible
through this technology.
Oligonucleotide synthesis technology has allowed researchers with the
ability to construct short fragments of DNA with sequences. These
oligonucleotides could then be used in probing vast libraries of cDNA to
extract genes containing that sequence. Alternatively, these DNA fragments
could also be used in polymerase chain reactions (PCR) to amplify existing
DNA sequences or to modify these sequences.
Two common goals in sequence analysis are to identify sequences that
encode proteins, which determine all cellular metabolism, and to discover
sequences that regulate the expression of genes or other cellular processes.
Some important laboratory techniques which are useful to decipher the
information content of genomes are given below:
Restriction enzymes isolated from bacteria digest double-stranded DNA
molecule at specific base sequences. They throw some light into the specific
organization and sequence of a DNA molecule. When DNA is digested it will
yield many DNA fragments. Gel electrophoresis is used to separate these
different fragments from each other using electrical current to pass through a
matrix of agarose or acrylamide (gel) carrying the fragment which was loaded
on the upper end of the gel. Blotting of the gel and hybridization of the
nitrocellulose paper which contains the DNA fragments after blotting are done
to find the gene fragment by using specific probes. To generate sufficient
quantity and quality of specific gene, cloning is done by inserting it into
chromosome-like carriers called vectors that allow their replication in living
cells. They can be purified and used for analysis. Polymerase chain reaction
(PCR) can be used to get large quantities of particular gene regions from very
small quantities. This is a powerful alternative to cloning.
4.1 GENOMICS AND PROTEOMICS
Genomics is the development and application of molecular mapping,
sequencing, characterization, computation and analysis of entire genomes of
organisms and whole set of gene products. Genome refers to the entire
complement of genetic material in a chromosome set. The analysis of whole
genome gives us new insights into global organization, expression, regulation
and evolution of the hereditary materials (Fig. 4.1)
Structural, Functional and comparative Genomics

Genomics has three distinct subfields: structural genomics, functional
genomics and comparative genomics. Structural genomics is the genetic
mapping, physical mapping and sequencing of most genomes. Genetic maps
provide molecular landmarks for building the higher-resolution physical and
sequence maps and also provide molecular entry points for researchers
interested in cloning genes.
Physical maps provide a view of how the clones from genomic clone
libraries are distributed throughout the genome. They provide clone resource
for positional cloning. Genome DNA sequences are helpful in describing the
functions of all genes including gene expression and control.
Functional genomics is the global study of the structure, expression
patterns, interactions, and regulation of the RNAs and protein encoded by the
genome. It is the comprehensive analysis of the functions of genes and
nongene sequences in the entire genomes.
Comparative genomics allows the comparison of entire genomes of
different species with the goal of enhancing our understanding of the
functions of each genome, including evolutionary relationships.
Genomics
Whole genomic Comparative Functional

maping genomics genomics
High-resolution
genetic maps Gene
Chromosome Transcript Interaction
conservation maps
evolution expression
Physical maps and evolution
Protein
Sequence maps expression
Transcript Polypeptide
maps maps
Fig. 4.1 Genomic analysis: A hierarchical view of genomic analysis (Source: A.J.F. Griffiths
4.1
et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman &
Company, New York, 2002).
DNA and Protein Sequencing and Analysis 4.3
Approaches to Genome Sequencing
Determination of the complete genomic DNA sequence of an organism allows
attempts to be made to identify all of an organism’s genes and therefore define
its genotype. Special experimental techniques have been devised to carry out
the difficult task of manipulating and characterizing large numbers of genes
and large amounts of DNA.
One approach to genome sequencing is first to generate high resolution
genetic and physical maps of the genome to define segments of increasing
resolution and then to sequence the segments in an orderly manner. Another
approach, the direct shotgun approach, is to break up the genome into random,
overlapping fragments, then to sequence the fragments and assemble the
sequences using computer algorithms.
Analysis of genomic sequences reveals that each organism has an array
of genes required for basic metabolic processes and genes whose products
determine the specialized function of the organism. Complete genome
sequencing therefore provides a knowledge base on which to build information
about gene and protein expression, but is not sufficient on its own to define the
entire protein component of the organism.
Proteomics
Proteomics is the cataloging and analysis of proteins to determine when a
protein is expressed, how much is made, and with what other proteins it can
interact. The term proteomics indicates proteins expressed by a genome. It is
the systematic analysis of protein profiles of tissues. The word proteome refers
to all proteins produced by a species at a particular time. Proteome varies with
time and is defined as “the proteins present in one sample (tissue, organism,
cell culture) at a certain point in time”.
Proteomics represents the genome at work and it is a dynamic process.
Proteomics can be divided into expression proteomics (the study of global
changes in protein expression) and cell-map proteomics (th systematic study
of protein-protein interactions through the isolation of protein complexes).
There is an increasing interest in proteomics because DNA sequence
information provides only a static snapshot of the various ways in which the
cell might use its proteins whereas the life of the cell is a dynamic process.
Proteins expressed by an organism change during growth, disease and the
death of cells and tissues. Proteomics attempts to catalog and characterize these
proteins, compare variations in their expression levels in healthy and diseased
tissues, study their interactions and identify their functional roles using leading
edge technological capability. Proteomics begins with the functionally modified
protein and works back to the gene responsible for its production.
Goals
The goals of proteomics are: (i) to identify every protein in the proteome, (ii) to
determine the sequence of each protein and entering the data into databases
and (iii) to analyse globally protein levels in different cell types and at different
stages in development.
Structural and Functional Proteomics

Proteomics research can be categorized as structural proteomics and
functional proteomics. Structural proteomics or protein expression measures
the number and types of proteins present in normal and diseased cells. This
approach is useful in defining the structure of proteins in a cell. Some of these
proteins may be targets for drug discovery. Functional proteomics is the study
of proteins’ biological activities. An important function of proteins is the
transmission of signals using intricate pathways populated by proteins,
which interact with one another.
There are three main steps in proteome research:
(i) Separation of individual proteins by 2D PAGE
(ii) Identification by mass spectrometry or N-terminal sequencing of
individual proteins recovered from the gel
(iii) Storage, manipulation and comparison of the data using bioinformatics
tools.
Uses
Proteomics will contribute greatly to our understanding of gene function in the
post genomic era. Differential display proteomics for comparison of protein
levels has potential application in a wide range of diseases. Because it is often
difficult to predict the function of a protein based on homology to other proteins
or even their three-dimensional structure, determination of components of a
protein complex or of a cellular structure is central in functional analysis.
Proteomics will also play an important role for drug discovery and
development by characterizing the disease process directly by finding sets of
proteins (pathways or clusters) that together participate in causing the disease.
Proteomics can be seen as a mass-screening approach to molecular
biology, which aims to document the overall distribution of proteins in cells,
identify and characterize individual proteins of interest, and ultimately
elucidate their relationships and functional roles.
Such direct protein-level analysis has become necessary because the
study of genes, by genomics, cannot adequately predict the structure or
dynamics of proteins, since it is at the protein level that most regulatory
processes take place, where disease processes primarily occur and where most
drug targets are to be found.
4.2 GENOME MAPPING

Before the advent of genomic analysis, the genetic basis of the knowledge of an
organism usually included relatively low-resolution chromosomal maps and
physical maps of genes producing known mutant phenotypes. Starting with
these genetic linkage maps, whole genome molecular mapping generally
proceeds through several steps of increasing resolution (Fig. 4.2). A genetic
map is a representation of the genetic distance separating genes derived from
the frequency of genetic recombination between the genes.
Genetic mapping is the process of locating genes to chromosomes and
assigning their relative genetic distances from other known genes. Genetic
maps of genomes are constructed using genetic crosses and for humans,
pedigree analysis. Genetic crosses are used to establish the location of genetic
markers (any allele that can be used to mark a location on a chromosome or a
gene) on chromosomes and to determine the genetic distance between them.
Historically genes have been used as markers of genetic mapping
experiments. Now, another type of genetic marker, DNA marker, is used to
develop the genetic map. DNA markers are genetic markers that are detected
using molecular tools that focus on the DNA itself rather than on the gene
product or associated phenotype.
Four types of DNA markers are used in human genomic mapping:
(i) Restriction fragment length polymorphism (RFLP), (ii) Variable number of
tandem repeats (VNTR) (also called mini satellite), (iii) Short tandem repeats
(STR) (also called microsatellite sequences) and (iv) Single nucleotide
polymorphisms (SNP) (Simultaneous typing of hundreds of SNPs can be done
using DNA microarrays).
Cytogenetic
mapping
Gene
Molecular Molecular Molecular
marker 1 marker 2 marker 3 Genetic
high-resolution
mapping
Gene Cloned
fragments
Physical
mapping
DNA sequencing
TTAGCTTAACGTACTGGTACCGTACCGTGGCTTAT
Fig. 4.2 Overview of the general approaches of whole genome mapping. General scheme
for making a genome map by using analyses at increasing levels of resolution (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman &
Company, New York, 2002).
4.3 DNA SEQUENCING METHOD
Methods are available to determine the order of nucleotides in DNA. One of the
methods is called chain termination sequencing or dideoxy sequencing or the
Sanger method after its inventor. The basic sequencing reaction consists of a
single – stranded DNA template, a primer to initiate the nascent chain, four
deoxyribonucleoside triphosphates (dATP, dCTP, dGTP and dTTP) and the
enzyme DNA polymerase, which inserts the complementary nucleotides in the
nascent DNA strand using the template as a guide.
Normally four DNA polymerase reactions are set up, each containing a
small amount of one of four dideoxyribonucleoside triphosphates (ddATP,
ddCTP, ddGTP and ddTTP). These act as chain terminating competitive
inhibitors of the reaction. Each of the four reaction mixtures generate a nested
set of DNA fragments, each terminating at a specific base (Fig. 4.3).
Template
5¢ GGATTCTGCTACGGA 3¢
5¢
Primer
Reaction including ddATP
ddATGCCT
ddACGATGCCT
ddAGACGATGCCT
ddAAGACGATGCCT A C G T
H
Reaction including ddCTP
C
5¢ GGATTCTGCTACGGA 3¢ C
ddCT T
A
ddCCT A
ddCGATGCCT G
ddCTAAGACGATGCCT A
ddCCTAAGACGATGCCT C
G
A
Reaction including ddGTP T
G
5¢ GGATTCTGCTACGGA 3¢ C
ddGCCT C
ddGATGCCT L T
ddGACGATGCCT
(b)
Reaction including ddTTP
ddT
ddTGCCT
ddTAAGACGATGCCT
(a)
Fig. 4.3 Principle of DNA sequencing (a) Four sequencing reactions are set up, each
containing a limiting amount of one of the four dideoxynucleotides. Each reaction generates
a nested set of fragments terminating with a specific base as shown. (b) A polyacrylamide
gel is shown with each reaction running in a separate lane of clarity. In a typical automated
reaction, all reactions would be pooled prior to electrophoresis and the terminal nucleotide
determined by scanning for a specific fluorescent tag. (Source: Twyman, R.M., Advanced
Molecular Biology @ BIOS Scientific Publishers Ltd., 1998).
Automated Methods
Most DNA sequencing reactions are automated, these days. Each reaction
mixture is labeled with a different fluorescent tag (on either the primer or on
one of the nucleotide substrates), which allows the terminal base of each
fragment to be identified by a scanner. All four reaction mixtures are then
pooled and the DNA fragments are separated by polyacrylamide gel
electrophoresis (PAGE). Smaller DNA fragments travel faster than the larger
ones.
Thus the nested DNA fragments are separated according to size. The
resolution of PAGE allows polynucleotides differing in length by only one
residue to be separated. Near the bottom of the gel, the scanner scans the
fluorescent tag as each DNA fragment moves past, and this is converted into
trace data, displayed as a graph comprising colored peaks corresponding to
each base (Fig. 4.4).
A C C A G C G G C T C T
Fig. 4.4 A sample of a high quality sequence trace, where all peaks are easily called.
Peaks are typically period in different color (shown here as different line styles) to aid visual
interpretation. Software such as Phred is used to read the peaks and assign quality value (A
= dark line; C= lighter; G = dotted line; T = dark line with breaks). (Source: Westhead, D.R.
et al., Instant Notes: Bioinformatics, Bios Scientific Publishers Ltd., 2003)
DNA sequences are stored in databases. Genomic DNA sequences, copy

DNA (cDNA) sequences and recombinant DNA sequences are available in
databases. Genome sequencing is done using shotgun sequencing or clone
contig strategies. Many different programs such as Phred, Vector-clip,
CrossMatch, RepeatMaster, Phrap, Staden Gap4 have been used in quality
control of sequences.
The arrival of high-throughput automated fluorescent DNA sequencing
technology has led to the rapid accumulation of sequence information; it
provides the basis for abundant computationally derived protein sequence
data. Analysis of DNA sequence underpins a number of aspects of research;
these include, for example, detection of phylogenetic relationships; genetic
engineering using restriction site mapping; determination of gene structure
through intron/exon prediction; interference of protein coding sequence
through open reading frame (ORF) analysis, etc.
Exons, Introns and CDS

The central dogma states that DNA is transcribed into RNA, which is then
translated into protein. In eukaryotic systems, exons form a part of the final
coding sequence (CDS), whereas introns though transcribed are edited out by
the cellular machinery before the mRNA assumes its final form (Fig. 3.3). DNA
sequence databases typically contain genomic sequence, which includes
information at the level of the untranslated sequence, introns and exons,
mRNA, cDNA and translations.
Untranslated regions (UTRs) occur both in DNA and RNA; they are
portions of the sequence flanking CDS that are not translated into protein.
Untranslated sequence, particularly at the 3‘ end, is highly specific both to the
gene and to the species from which the sequence is derived.
5¢ 3¢
Intron Intron
5¢ UTR Exon Exon Exon 3¢ UTR
Sense strand genomic DNA
Transcription
5¢ UTR CDS 3¢ UTR
mRNA
Translation
Protein
Fig. 4.5 In eukaryotic systems exons from a part of the final coding sequence (CDS),
whereas introns are transcribed, but are the edited out by the cellular machinery before the
mRNA assumes its final form. Here, the gene is made up of three exons and two introns.
Exons, unlik coding sequences are not simply terminated by stop codons, but rather by
intron-exon boundaries; the untranslated regions (UTRs) occur at either end of the gene; if
transcription begins at the 5' end of the sequence, then the 5' UTR contains promoter sites
(such as the TATA box), and the 3' UTR follows the stop codon. (Source: Attwood, T.K. and
Parry-Smith, D.J., Introduction to Bioinformatics, Pearson Education Ltd., 2001)
Primer Design
The location of the primers on a DNA source will be determined relative to the
start and stop codons of the gene. The default option will find the ‘forward’
primer of a given length that resides within the first 35 basepairs upstream of
the coding sequence. The default option will also find the ‘reverse’ primer that
resides within 35 basepairs immediately following the coding sequence. We
can alter the endpoints of either of these by changing the number in the
Distance from the Start’ and ‘Distance from the Stop’ fields. We can also define
the exact 5’ endpoints of the primers by selecting the button marked ‘YES’ on
the line which asks about the exact endpoints.
Procedure
Open the Internet browser and type the URL address: http://
frodo.wi.mit.edu.cgi.bin/ primers3/primer3_www.cgi. Paste the sequence in
the text box. Choose the primer. Click the left and right primer. Press ‘Pick
Primer’ button and the result will be displayed in a new page.
4.4 OPEN READING FRAME (ORF)

ORFs are stretches of DNA sequence uninterrupted by codons which would
cause protein synthesis to fail, and which are bounded by appropriate start
and stop signals. An ORF is a nucleotide sequence without a ‘stop’ signal that
encodes some minimal number of amino acids (about 100). In prokaryotes,
identifying ORFs is fairly straightforward. In eukaryotes, because of introns
and exons assignment of ORFs is complicated.
Which is the correct reading frame for translation? The longest frame
uninterrupted by a stop codon (TGA, TAA or TAG) is normally supposed to be
the correct reading frame. Such a frame is known as an open reading frame
(ORF). Finding the end of an ORF is easier than finding its beginning.
We may use several features as indicators of potential protein coding
regions in DNA. One of these is sufficient ORF length. Recognition of flanking
kozak sequences may also be helpful in pinpointing the start of the CDS
(Fig. 4.6). Patterns of codon usage differ in coding and non-coding regions.
cDNA 5¢ 3¢
EST
CDS
UTR
Fig. 4.6 When constructing a library, complementary DNA (cDNA) is run off from the
mRNA stage, using reverse transcriptase. ESTs are then generated using a single read of
each clone on an automated sequencing system. In the mRNA, the start codon may be
flanked by a Kozak sequence, which gives additional confidence to the prediction of the
start of the CDS. (Source: Attwood, T.K. and Parry-Smith, D.J., Introducton to
Bioinformatics, Pearson Education Ltd., 2001)
Specifically, the use of codons for particular amino acids varies

according to species, and codon-use rules break down in regions of sequence
that are not destined to be translated. Thus, codon-usage statistics can be used
to infer both 5‘ and 3‘ untranslated regions and to assist the detection of
mistranslations, because there is an uncharacteristically high representation of
rarely used codons in these regions.
The table 4.1 illustrates the considerable variability in selection of codons
that different organisms employ for a particular amino acid. In addition to
their characteristic pattern of codon usage, may organisms show a general
preference for G or C over A or T in the third base (Wobble) position of a codon.
The consequent bias towards G/C in this base can further contribute to
diagnosis of ORFs.
Table 4.1: Percentage use of codons for serine in a variety of model organisms.
There are six possible codons for serine, which in principle could be used with
equal frequency whenever serine is specified in a CDS. In practice, however,
organisms are highly selective in the particular codons they use. The characteristic
differences in usage reflected here can be used to help diagnose regions of DNA
that may code for protein.
Codon E. coli D. melanogaster H. sapiens Z. mays S. cerevisiae

AGT 3 1 10 4 5
AGC 20 23 34 30 4
TCG 4 17 9 22 1
TCA 2 2 5 4 6
TCT 34 9 13 4 52
TCC 37 42 28 37 33
In the region upstream of the start codon of prokaryotic genes, detection

of ribosome binding sites, which help to direct ribosomes to the correct
translation start positions, is considered to be a powerful ORF indicator.
One consequence of the presence of exons and introns in eukaryotic
genes is that potential gene products can be of different lengths, because not all
exons may be represented in the final transcribed mRNA (although the order
of exons that are included is preserved).
When the mRNA editing process results in different translated
polypeptides, the resulting proteins are known as splice variants or
alternatively spliced forms. Thus, results of database searches with cDNA or
mRNA (transcription level information) that appear to indicate substantial
deletions in matches to the query sequence could, in fact, be the result of
alternative splicing.
4.5 DETERMINING SEQUENCE OF A CLONE

A clone is a copied fragment of DNA maintained in circular form identical to
the template from which it is derived. The process of determining the
nucleotide sequence of a clone also helps in the analysis of DNA sequences. In
an experiment to clone a specific gene whose sequence is already known, it is
necessary to check that the cloned sequence is indeed identical to the
published one.
A cDNA clone is synthesized using mRNA as a template. The clone is
then sequenced by designing primers to known oligonucleotides present in the
cloning vector flanking the inserted DNA. When the primers hybridize to the
corresponding sequences, they are extended in a chain synthesis reaction
using the inserted sequence as template (Fig. 4.7).
The reaction is terminated by the incorporation of a dideoxynucleotide
(ddATP, ddTTP, ddGTP, or ddCTP). Not all the chains terminate at the same
base, since normal bases (dATP, dTTP, dGTP or dCTP) are also present in the
reaction mixture.
The result is a series of fragments for each primer, all of different lengths
because they have been terminated at different base positions. The generated
fragments are run on standard radioactive sequencing gels, or fluorescent
sequencing machines, as appropriate, to determine the order of bases in a
sequence. The assembler program builds a consensus sequence for the clone,
according to a weighting given to each nucleotide position in the sequence.
(a) Terminated chain ddGTP
3¢ Template DNA
(b) 5¢ ddGTP
ddGTP
5¢ ddGTP
5¢ ddGTP
3¢ C C CC 5¢
Fig. 4.7 Template DNA sequencing: (a) Chain synthesis and termination by incorporation of
ddGTP;) (b) the family of chains terminated at different positions by ddGTP. Since G pairs
with C the template sequence contains C at each of these positions.
Whole genome shotgun sequencing assembly (Fig. 4.8) is also used to

sequence clones from physical map of a genome. In whole genome shotgun
sequencing, the portions of the inserts adjacent to the junction points with
vector sequences are sequenced from a great many random clones throughout
the genome, and the overlapping sequence information is used to assemble the
sequence of the entire genome and to reconstruct the physical map of the
clones.
The rapid accumulation of DNA sequence data has been expedited by
the introduction of fluorescent sequencing technology. Larger number of
sequencing reactions can be carried out and the protocols are more readily
adapted to automation. When the reactions are run on a fluorescent
sequencing gel, computers are used to interpret the laser-activated fluorescence
and convert it into a digital form suitable for further analysis.
Contig 1 Contig 1 Contig 1
Paired Paired
and and
reads reads
Scaffold
Sequenced Sequenced Sequenced
contig 1 GAP contig 2 GAP contig 3
Fig. 4.8 Whole genome shotgun sequencing assembly. First, the unique sequence overlaps
between sequences reads are used to build contigs. Paired-end reads are then used to span
gaps and the order and orient the contigs into larger unit called Scaffolds. (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Typically 36 lanes are run on a gel at once. The output consists of a series
of (colour-coded) peaks, beneath which is a string of base symbols. Sometimes
the software that interprets the chromatogram is unable to determine which base
should be called at specific position. So a ‘-‘ appears. Such ambiguous positions
are replaced by ‘N’ in the resulting sequencing file.
4.6 EXPRESSED SEQUENCE TAGS

Expressed sequence tag is a partial sequence of a clone, randomly selected
from a cDNA library and used to identify genes expressed in a particular
tissue. We do not always have the full length DNA sequences; a large part of
currently available DNA data is madeup of partial sequences, the majority of
which are Expressed Sequence Tags (ESTs).
In analyzing ESTs some points should be kept in mind: (i) The EST
alphabet is five characters, ACGTN. (ii) There may be phantom INDELS
resulting in translation frame shifts. (iii) The EST will often be a sub-sequence
of any other sequence in the databases. (iv) The EST may not represent part of
the CDS of any gene.
How an EST is sequenced is given in Figure 3.7. A cDNA library is
constructed from a tissue or cell line of interest. The mRNA is isolated from
tissue or cell. The mRNA is then reverse-transcribed into cDNA, usually with
an oligo (dT) primer, so that one end of the cDNA insert derives from the polyA
tail at the end of the mRNA. The other end of the cDNA is normally within the
coding sequence but may be in the 5’ untranslated region if the coding
sequence is short. The resulting cDNA is cloned into a vector.
Individual clones are picked from the library, and one sequence is
generated from each end of the cDNA insert. Thus, each clone normally has a
5’ and 3’ EST associated with it. Because ESTs are short, they generally
represent only fragments of genes and not complete coding sequences. A
typical EST will be between 200 to 500 bases in length.
The EST production process is normally highly automated and typically
involves use of a fluorescent laser system that reads the sequencing gels. The
resulting sequences are downloaded to a computer system for further analysis.
Does this EST represent a new gene? To answer this question, a DNA
database search is usually performed. If the result shows a significant
similarity to a database sequence, the normal procedure for classifying the hit
will determine whether a novel gene has been found. If however, the result
shows no significant similarity, we cannot immediately assume that a new
gene has been discovered; it may be that the EST represents non-coding
sequence, for a known gene, that simply is not in the database.
Many mRNAs (especially humans) have long untranslated regions at the
5’ and 3’ ends of the CDS. It is possible for an EST to be entirely from one of
these noncoding regions. If we are lucky, the section of untranslated
(noncoding) sequence will already be in the database. If it is, a direct match
will be found, as untranslated regions are highly conserved and specific to
their coding gene.
Cell or tissue
Isolate mRNA and Deposit the

reverse transcribe EST sequences
into cDNA dbEST
5’ EST 3’ EST
Clone cDNA into a vector

to make a cDNA libarary
cDNA cDNA
cDNA cDNA Sequence the
5’ and 3’ ends
of cDNA insert
Vector Vector
Vector
Pick individual
clones Vector
Fig. 4.9 Overview of how ESTs are constructed. (Source: Wolfberg, T.G. and Landsman, D.,
Expressed Sequence Tags (ESTs), in Bioinformatics – a practical guide to the analysis of
genes and proteins (eds) Baxevanis, A.D. and Francis Quellette, B.F., John Wiley & Sons,
Inc, 2002)
If we are unlucky, no match will be found, indicating one of the two

possibilities; either (i) the EST represents a CDS for which there is no similar
sequence on the database (still a distinct possibility), or ii) it represents a non-
coding sequence that is not in the database. It is critical to the interpretation of
EST analysis that a distinction is made between these two situations
(Fig. 4.10).
Exon 1 Exon 2 Exon 3 Exon 4

Genomic
DNA
240 241 528 529 696 697 816

5¢ 3¢ cDNA
5¢EST 3¢EST ESTs
Fig. 4.10 The alignment of fully sequenced cDNAs and ESTs with genomic DNA. The soild
4.1
lines indicate regions of alignment; for the cDNA, these are the exons of the gene. The dots
between segments of cDNA or ESTs indicate regions in the genomic DNA that do not align
with cDNA or EST sequences; these are the locations of the introns. The number above the
cDNA line indicate the base coordinates of the cDNA sequence, where base 1 is the 5' -
most base and base 816 is the 3' -most base of the cDNA. For the ESTs, only a short
sequence read form either the 5' or 3' end of the corresponding cDNA is obtained. This
establishes the boundaries of the transcription unit, but it is not informative about the
internal structure of the transcript unless the EST sequences cross an intron (as is ture for
the 3' EST depicted here). (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H.
Freeman and Company, 2002).
4.7 PROTEIN SEQUENCING

Direct RNA sequencing involves the chemical characterization of modified
nucleotides. The most sensitive comparisons between sequences are done at
the protein level; detection of distantly related sequences is easier in protein
translation, because the redundancy of the genetic code of 64 codons is
reduced to 20 distinct amino acids of the functional building blocks of
proteins. Because proteins are a functional abstraction of genetic events that
occur in DNA the loss of degeneracy at this level is accompanied by a loss of
information that relates more directly to the evolutionary process.
Direct protein sequencing was carried out using a process called Edman
degradation in the past. In this the terminal residue of a protein is labeled,
removed and then identified using a series of chemical tests. Current methods
of protein sequencing rely on mass spectrometry (MS), a technique in which
the mass/charge ratio (m/e or m/z) of ions in a vacuum is accurately
determined allowing molecular masses to be calculated.
Determination of Structure
Protein structures can be determined using X-ray crystallography and nuclear
magnetic resonance spectroscopy (NMR). X-ray crystallography involves the
reconstruction of atomic positions based on the diffraction pattern of X-rays
through a precisely orientated protein crystal. Scattered X-rays cause positive
and negative interference, generating an ordered pattern of signals called
reflections.
Structural determination depends on three variables: the amplitude and
phase of the scattering (which depend on the number of electrons in each
atom), and the wavelength of the incident X-rays. The basis of NMR
spectroscopy is that some atoms, including natural isotope of nitrogen,
phosphorous and hydrogen behave as tiny magnets and can switch between
magnetic spin states in an applied magnetic field. This is achieved by the
absorbance of low wavelength electromagnetic radiation, generating NMR
spectra. Other methods such as magic angle spinning NMR and circular
dichroism spectroscopy are also used.
Prediction
There are three main approaches to secondary structure prediction:
(i) empirical statistical methods that use parameters derived from known 3D
structures; (ii) methods based on physicochemical criteria (e.g., fold
compactness, hydrophobicity, charge, hydrogen bonding potential, etc.) and
(iii) prediction algorithms that use known structures of homologous proteins
to assign secondary structure.
One of the standard empirical statistical methods is that of Chou and
Fasman, which is based on observed amino acid conformational preferences in
non-homologous proteins. But in spite of being a ‘standard’ approach, like all
other methods, its reliability to derive the conformational potentials of the
amino acids has been inadequate. By contrast, for prediction algorithms, the
use of multiple sequence data can improve matters and may yield
enhancements of several percent. Tertiary structure prediction (especially
methods that build on secondary predictions) is still further beyond reach.
4.8 GENE AND PROTEIN EXPRESSION ANALYSIS

The activity of a gene is called gene expression in which the gene is used as a
blueprint to produce a specific protein. Patterns in which a gene is expressed
provide clues to its biological role. All functions of cells, tissues and organs are
controlled by differential gene expression.
Gene expression is used for studying gene function. Knowledge of which
genes are expressed in healthy and diseased tissues would allow us to identify
both the protein required for normal function and the abnormalities causing
disease. This information will help in the development of new diagnostic tests
for various illnesses as well as new drugs to alter the activity of the affected
genes or proteins.
Usually gene expression has been studied at either RNA or protein level
on a gene-by-gene basis using Northern and Western blot techniques. Now
global expression analysis methods are available which study all genes
simultaneously. A simple but expensive technique to analyse at the RNA level
is direct sequence sampling from RNA populations or cDNA libraries or even
from sequence databases.
In a more sophisticated technique called serial analysis of gene expression
(SAGE), very short sequence tags (usually 8-15 nt) are generated from each
cDNA and hundreds of these are joined together to form a concatemer prior to
sequencing. In one sequencing reaction, information on the abundance of
hundreds of mRNA, can be gathered. Each SAGE tag uniquely identifies a
particular gene, and by counting the tags, the relative expression levels of each
gene can be determined (Fig. 4.11).
+ Biotinylated
Poly A RNA +
AAAAA TTTTTT oligo dT
cDNA synthesis
AAAAA
TTTTTT
Nla III
Streptaridin-coated
magnetic beads
CATG AAAAA
GTAC TTTTTT
+ +
‘Pool A’ ‘Pool B’
AAAAA AAAAA
TTTTTTT TTTTTTT
13bp Fok I 13bp Fok I
Ligate, PCF
amplify
Restriction digest, purify

Nla III ditags and concatenate
Clone and sequence
CATGCCTAGTCAGGCGACTTCACATGCCAAAGTGCTTTCGAGACATGGAAGTCCTACGATCATGGCATG
Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Tag 6

Ditag A Ditag B Ditag C
Fig. 4.11 Simplified outline method for serial analysis of gene expression. Nla lll is a frequent
cutting restriction enzyme used intially to generate the 3' cDNA fragments and provide the
overhang for linker ligation, and later to remove the linkers prior to concatamerization of the
ditags. Foki is type lls restriction enzyme with a recognition site in the linker that generates the
SAGE tags by cutting the DNA a few bases downstream. (Source: D.R. Westhead et al., Instant
Notes: Bioinformatics, Bios Scientific Publishers Ltd. 2003)
4.8.1 DNA Microarrays
Presently, DNA arrays (DNA chips) are used widely. A DNA microarray or
DNA chip is a dense grid of DNA elements (often called features or cells)
arranged on a miniature support, such as nylon filter or glass slide. Each
feature represents a different gene. (The specificity of nucleic acid hybridization
is such that a particular DNA or RNA molecule can be labeled (with a
radioactive or fluorescent tag) to generate a probe, and can be used to isolate a
complementary molecule from a very complex mixture, such as whole DNA or
whole cellular RNA).
The array is usually hybridized with a complex RNA probe, i.e. a probe
generated by labeling a complex mixture of RNA molecules derived from a
particular cell type. The composition of such a probe reflects the levels of
individual RNA molecules in its source. If non saturating hybridization is
carried out, the intensity of the signal for each feature on the microarray
represents the level of the corresponding RNA in the probe, thus allowing the
relative expression levels of thousands of genes to be visualized simultaneously.
The most widely used method involves the robotic spotting of individual
DNA clones onto a coated glass slide. Such spotted DNA arrays can have a
density of up to 5000 features per square cm. The features comprise double-
stranded DNA molecules (genomic clones or cDNAs) up to 400 bp in length
and must be denatured prior to hybridization (Fig. 4.12)
Test Reference
DNA clones Laser 1 Laser 2
Reverse
transcription
Label with Emission
fluor dyes
Quantify emission in
PCR amplification
red and green
purification
wavelength bands
Robotic
printing
Analyze relative
expression levels
Hybridize terget by computer
to microarray
Fig. 4.12 The Process of differential expression measurement using a DNA microarray.
DNA clones are first amplified and printed out to form a microarray. Test and reference RNA
samples are then reverse transcribed and labled with different fluor dyes (Cy5 and Cy3),
which fluoresce in different (red, green) wavelength bands. These are hybridized to the
microarray. Fluorescence of each dye is then measured for each samples. (Source: Duggan
D.J. et al., Expression profiling using cDNA microarrays. Nature Gene. 21 (suppl 2): pp
10-14, 1999).
Genechips
Another method is on-chip photolithographic synthesis, in which short
oligonucleotides are synthesized in situ during chip manufacture. These arrays
are known as Genechips. They have a density of up to 1,000,000 features per
square cm, each feature comprising up to 109 single-stranded oligonucleotides
25 nt in length. Each gene on a Genechip is represented by 20 features (20
overlapping oligos), and 20 mismatching controls are included to normalize
for nonspecific hybridization.
Fluorescent probes are used for spotted DNA arrays, since different
fluorophores can be used to label different RNA populations. These can be
simultaneously hybridized to the same array, allowing differential gene
expression to be monitored directly. In Genechips, hybridization is carried out
with separate probes on two identical chips and the signal intensities are
measured and compared by the accompanying analysis software.
Data Analysis
The raw data from microarray experiments consists of images from hybridized
arrays. The exact nature of the image, depends on the array platform (the type
of array used). DNA arrays may contain many thousands of features.
Therefore, data acquisition and analysis must be automated. The software for
initial image processing is normally provided with the scanner. This allows
the boundaries of individual spots to be determined and the total signal
intensity to be measured over the whole spot (signal volume). The signal
intensity should be corrected for background and control measures should be
included to measure nonspecific hybridization and variable hybridization
across arrays.
The aim of data processing is to convert the hybridization signals into
numbers which can be used to build a gene expression matrix. The
interpretation of microarray experiment is carried out by grouping the data
according to similar expression profiles. Clustering is a way of simplifying
large data sets by partitioning similar data into specific groups. Many software
applications are available for implementing microarray data analysis methods
(Table 4.2).
Applications
DNA microarray has the following applications:
(i) Investigating cellular states and processes: Patterns of expression that
change with cellular state can give clues to the mechanisms of the
processes such as sporulation, or the change from aerobic to anaerobic
metabolism.
(ii) Diagnosis of disease: Testing for the presence of mutations can confirm
the diagnosis of a suspected genetic disease, including detection of late-
onset condition such as Huntington disease, to determine whether
prospective parents are carriers of a gene that could threaten their
children.
Table 4.2: Internet resources for microarray expression analysis. The first two sites
are very comprehensive and contain hundreds of links to databases, software and
other resources. Two web-based suites of analysis program are also listed as well as
some databases that store microarray and other gene expression data.
URL Product(s) Comments
Sties with extensive links to microarray analysis software and resources

http://smd.stanford.edu Cluster, Xcluster, Extensive list of
SAM, Scanalyze, software resource from Stanford
many others University and other sources, both
downloadable and www-based.
http://smd.stanford.edu/ Cluster, Cleaver, GeneSpring, Comprehensive list of downloadable and
resources/databases.shtml www-based software of microarray analysis
Genesis, many others and data mining plus links to gene
expression databases.
www-based microarray Expression profiler Very powerful
data analysis suite of programs from the EBI for analysis
http://www.ebi.ac.uk/ and clustering of expression data.
expressionprofiler/
http:// bioinfo.cnio.es DNA arrays analysis tools A suite of
.dnarray/ analysis/ programs from the National Spanish Cancer
http://www.cbs.dtu.dk/ Centre (CNIO) including two sample
biotools/DNAarraytools. correlation plot, hierarchical clustering,
phpSOM, neural network are tree
viewers.
Micro array databases
http://www.ncbi. National Centre for GEO (Gene
nlm.nlh.gov/geo/ Biotechnology Information Expression Omnibus) GEO is a gene
(NCBI) expression and hybridiaztion array database,
which can be searched by accession number,
through the contents page or through the
Entrez ProbeSet search interface.
http://www.ebi.ac. ArrayExpress EBI microarray gene expression database.
uk/arrayexpress/ Developed by MGED and supports MIAME.
http://www.ncgr.org/genex/
http://genex.gene- GeneX The GeneX gene expression database is an
quantification.info/ integrated tool set to the analysis and
http://www.informatics.jax. comparison of microarray data.
org/mgihome/GXD/aboutGXD.shtml
(iii) Genetic warning signs: Some diseases are not determined entirely and
irrevocably by genotype, but the probability of their development is
correlated with genes or their expression patterns. A person aware of an
enhanced risk of developing a condition can in some cases improve his
or her prospects by adjustments in lifestyle.
(iv) Drug selection: Detection of genetic factors that govern responses to
drugs, that in some patterns of gene expression. Knowing the exact type
of disease is important in selecting optimal treatments.
(v) Classification of disease: Different types of leukemia can be identified by
different patterns of gene expression. Knowing the exact type of disease
is important in selecting optimal treatments.
(vi) Target selection for drug design: Proteins showing enhanced transcription
in particular disease states might be candidates for attempts at
pharmacological intervention (provided that it can be demonstrated, by
other evidence, that enhanced transcription contribute to or is essential
the maintenance of the disease state).
(vii) Pathogen resistance: Comparisons of genotypes or expression patterns,
between bacterial strains susceptible and resistant to an antibiotic,
point to the protein involved in the mechanism of resistance.
4.8.2 Protein Expression Analysis

2D Poly Acrylamide Gel Electrophoresis (2D-PAGE) is a well established
biochemical technique in which proteins are separated on the basis of two
separate properties: their isoelectric point (pI) (charge) and their molecular
mass. Separation in the first dimension is carried out by isoelecrtric focusing in
an immobilized pH gradient. The pH gradient is generated by a series of
buffers, and an immobilized pH gradient is produced by covalently linking the
buffering groups to the gel, thus preventing migration of the buffer itself during
electrophoresis.
Isoelectric focusing
Isoelectric focusing means allowing proteins to migrate in an electric field until
the pH of the buffer is the same as the pI of the protein. The pI of the protein is
the pH at which it carries no net charge and therefore does not move in the
applied electric field. Next the gel is equilibrated in the detergent sodium
dodecylsulphate (SDS), which binds uniformly to all proteins and confers a
net negative charge. Therefore, separation in the second dimension can be
carried out on the basis of molecular mass.
After the second dimension separation, the protein gel is stained with a
universal dye to reveal the position of all protein spots. Reproducible
separations can then be carried out with similar samples to allow comparison
of protein expression levels. It provides a diagnostic protein fingerprint of any
particular sample (Fig. 4.13).
The stained protein gel is scanned to obtain a digital image. Individual
protein spots are then detected and quantified, and the intensity of the signal
for each spot is corrected for local background. Several algorithms are
available based on Gaussian fitting or Laplacian of Gaussian spot detection.
Spots whose morphology deviates from a single Gaussian shape can be
interpreted using a model of overlapping shapes.
Fig. 4.13 A section from a 2D protein gel. The sample has been seperated on the basic of
isoelectric pH (horizontal dimension) and molecular mass (vertical dimension). Each spot
should correspond to a single protein.
Other Methods
A simpler approach is line and chain analysis, in which columns of pixels
from the digital image are scanned for peaks in signal density. This process is
repeated for adjacent pixel columns allowing the algorithm to identify the
centers of spots and their overall signal intensity. Another method is known as
watershed transformation. In this method, pixel intensities are viewed as a
topographical map so that hills and valleys can be identified. This is useful for
separating clusters, chains and small spots overlapping with larger ones
(shouldered spots) and also for merging regions of a single spot.
The output of each method is a spot list. Differential protein expression
can also be analysed using 2D-PAGE. This can be used to look for proteins
that are induced or repressed by particular treatments or drugs, to look for
proteins associated with disease states, or to look at changes in protein
expression during development. Once protein expression data have been
recorded, they are built into a protein expression matrix. The results from 2D-
PAGE experiments are generally stored in 2D-PAGE databases. They can be
found at:
http://www.ucl.ac.uk/ich/services/labservices/mass_spectrometry/
proteomics/technologies/2d_page
http://world-2dpage.expasy.org/swiss-2dpage/
4.8.3 Gene Discovery

Lately, substantial financial resources have been spent in the search for the
genes that may be linked to particular types of diseases. The objective is to
develop new therapies with which to combat a wide variety of prevalent
disorders, such as cancer, tuberculosis, asthma, etc. There are two main
strategies for discovering proteins that may represent suitable molecular
targets, whether for small molecular drug discovery or for gene therapy.
Approaches
One approach for discovering disease-related genes is the technique of
positional cloning. Here the chromosome linked to the disease in question is
found out by analyzing a population of people some of whom exhibit the
disease. Once a link to a chromosomal region is established, a large part of the
chromosome in the vicinity of the region (locus) is sequenced, yielding several
megabases of DNA. Such a locus can contain many genes, only one of which is
likely to be involved in some way in the disease process.
Sequence searching and gene prediction techniques can be used to
increase the efficiency of gene identification in the locus, but ultimately several
genes will need to be expressed, and further experimentation (or validation)
will be required to confirm which gene is actually involved in the disease.
Although genes discovered in this way can be very illuminating from an
academic point of view, they do not necessarily represent good drug targets (or
points of therapeutic intervention).
Another approach to gene discovery, requiring much less sequencing
effort and relying more heavily on the powerful search capabilities of current
computer systems, examines the genes that are actually expressed in healthy
and diseased tissues. This allows a comparison to be performed between the
two states, and a process of reasoning applied to arrive at a potential drug
target in a more direct way. This process analyses the mRNAs, which are used
by the cellular machinery as a template for the construction of the proteins
themselves.
Gene Finding
In gene finding, generally elements such as splice sites, start and stop codons,
branch points, promoters and terminators of transcription, polyadenylation
sites, ribosome binding site, topoisomerase-II binding sites, topoisomerase I
cleavage sites and various transcription factor binding sites are included.
Local sites like these are called ‘signals’ and are detected by ‘signal sensors’.
In contrast to this, extended and variable length sequences such as exons and
introns are called ‘contents’ and are detected by content sensors. Most
sophisticated signal sensors in use are neural nets. Commonly used content
sonsor is the one which predicts coding regions.
Several systems that combine signal and content sensors have been
developed in an attempt to identify complete gene structure. Such systems are
capable of handling more complex interdependencies between gene features.
Genelaug is one of the earliest integrated gene finders to date, which uses
dynamic scored regions and sites into a complete gene prediction with a
maximal total score.
The main feature of dynamic programming is the one which includes a
latent or hidden variable associated with each nucleotide that represents the
functional role or position of that nucleotide. These models are called hidden
Markov models (HMMs). Most popular statistical methods used for gene
finding are Markov models using gene mare program. Some of the important
gene finding HMMs include Ecoparse, Expound, etc. The list of computational
gene finding data bases are given in Table 4.3.
In prokaryotes, it is still common to locate gene by simply looking an
open reading frame (ORF). This is certainly not adequate for higher eukaryotes.
To distinguish between coding and noncoding regions in higher eukaryotes,
exon content sensors are used which use statistical models of the nucleotide
frequencies and dependencies, which are present in codon structure.
Table 4.3: Computational gene finding databases and genefinders
Datasets and genefinders Accession sites
1. Genefinding datasets
a) Single genes http://www.cbcb.umd.edu/research/genefinding.shtml
b) Annotated contigs ftp://www-hgc.ilb.gov/pub/genesets/
http://igs-server.cors-mrs.fr/banbury/index/hyml
c) Hmm-based gene finders
Genie http://www.fruitfly.org/seq_tools/genie.html
Genscan http://genes.mit.edu/GENSCANinfo.html,
HMMgene http://genes.mit.edu/GENSCAN.html
GenMark http://www.cbs.dtu.dk/services/HMMgene/
Pirate http://opal.biology.gatech.edu/GeneMark/
http://www.cbcb.umd.edu/software/pirate/
d) Other gene finders
AAT http://aatpackage.sourceforge.net/
FGENEH http://linux1.softberry.com/
berry.phtml?topic=fgenesh&group
=programs&subgroup=gfind
GENEID http://genome.crg.es/geneid.html
GeneParser http://beagle.colorado.edu/~eesnyder/geneparser.html
Glimmer http://www.cbcb.umd.edu/software/glimmer/
Grail http://grail.lsd.ornl.gov/grailexp/
Procrusters http://www-hto.usc.edu/software/procrusters
GENE FINDING http://www.molquest.com/
molquest.phtml?group=index&topic=gfind
http://www.biologie.uni-hamburg.de/b-online/library/
genomeweb/GenomeWeb/nuc-geneid.html
Levels of Gene Expression
The human genome is complex, consisting of about 3 billion base pairs (bp) of
DNA. Yet only 3% of the DNA is coding sequence (i.e. that part of the genome
that is transcribed and translated into protein). The rest of the genome consists
of areas necessary for compact storage of the chromosomes, replication at cell
division, the control of transcription, and so on. A large part of the work of
sequence analysis is centered on analyzing the products of the transcription/
translation machinery of the cell, i.e. protein sequences and structures.
Recently much industrial emphasis has been placed on the study of
mRNA; this is partly because a conceptual translation into protein sequence
can be generated readily, but the main reason is that mRNA molecules
represent the part of the genome that is expressed in a particular cell type at a
specific stage in its development.
Thus, in simple terms, we have three levels of genomic information: (i) the
chromosomal genome (genome) – the genetic information common to every cell
in the organism, (ii) the expressed genome (transcriptome) – the part of the
genome that is expressed in a cell at a specific stage in its development and (iii)
the proteome – the protein molecules that interact to give the cell its individual
character.
For each level, different analytical tools and interpretative skills are
required. Cells express a different range of genes at various stages during their
development and functioning. This characteristic range of gene expression is
the expression profile of the cell.
By capturing the cell’s expression profiles we can build up a picture of
what levels of gene expression may be normal or abnormal and what the
relative expression levels are between different genes within the same cell. This
process also provides a rapid approach to gene discovery that complements
full-blown genome sequencing projects.
Capturing Expression Profile

The procedure for capturing an expression profile is as follows: First a sample
of cells is obtained; then RNA is extracted from the cells and is stabilized by
using reverse transcriptase to run off cDNA from the RNA template. The
cDNA is transformed into a library (a cDNA library) suitable for use in rapid
sequencing experiments.
A sample of clones is selected from the library at random – e.g. 10000
from a library with a complexity of 2 million clones. A substantial automated
sequencing operation is required to produce 10,000 sequencing reactions, and
then to run these on automated sequencers. The resulting data are
downloaded to computers for further analysis.
The ideal result is a set of 10000 sequences each between 200 and 400
bases in length, representing part of the sequence of each of the 10000 clones.
In reality, some sequencing runs will fail altogether, some will fail to produce
sufficient sequence data and some will fail to produce data of appropriate
quality. The sequences that emerge successfully from this process are called
Expressed Sequence Tags (ESTs).
ESTs are submitted to GenBank, EMBL and DDBJ. ESTs can be accessed
through all these databases. The same ESTs are available from NCBI’s dbEST.
4.9 HUMAN GENOME PROJECT

A genome is the entire DNA in an organism. Robert Sinsheimer, a molecular
biologist by training, made the first proposal of Human Genome Project (HGP)
in 1985. While he was the chancellor of the University of California, he
organized a scientific meeting to discuss the possibility of the project.
Charles DeLisi, Head, Division of Health and Environmental Research,
Department of Energy (DOE) came to know about the HGP proposal and
became an avid supporter of the project. In 1986, DeLisi convened a meeting of
scientists who were in DNA research from laboratories at Livermore and Los
Alamos in USA and suggested to them to carry out the project with a primary
goal of determining the nucleotide sequence of human genome.
Due to legal problem, the National Academy of Sciences appointed a
committee and the committee suggested that both DOE and National Institute
of Health (NIH) should be involved with a common advisory board. In 1987,
under the leadership of James Wyngaarden, NIH secured $17.4 million fo the
project. James D. Watson became the first director of the new ‘Office of Human
Genome Research’ (OHGR).
The OHGR appointed Norton Zinder as chairman of program Advisory
committee on the Human Genome. In 1990, the office became a ‘center’ and
was called The National Center for Human Genome Research (NCHGR). In
1998, NCHGR became National Project and has been the largest and most
complex international collaboration with funding from their governments and
many charitable societies across the world.
The project goals are to:
• identify all the approximate 30,000 genes in human DNA
• determine 3 billion nucleotide base pairs of human DNA
• store information in databases
• develop tools for data analysis
• transfer related technologies to the private sector, and
• address the ethical, legal and social issues that may arise from the
project.
The first working draft of the entire human nuclear genome was
published in February 2001 issues of the Journals Nature and Science. Due to
rapid technological advancement, the project was completed by April 2003
itself (even though 2005 was the projected year of completion) and the
complete high quality reference sequence was made available to researchers
worldwide for practical applications.
Salient Features
A number of genes and their association with human diseases have also been
established. The content and some of the salient genetic features of the human
genome (Figure 4.14) are highlighted below:
• The human genome contains 3.2 billion nucleotide bases (A, C, T and
G)
• The sizes of the genes vary greatly. The average gene consists of 3000
bases. The largest known human gene is dystrophin (2.4 million bases)
• The functions are unknown for more than 50% of discovered genes.
• The sequences of human genome remain the same in 99.9% people.
• About 2% of the genome encodes instructions for the synthesis of
proteins.
Human nuclear genome

3200 million bases
Genes and related Intergenic DNA (junk

sequences 1200 Mb DNA) 2000 Mb
Genes Related sequences Interspersed repeats Other intergenic

48 Mb 1152 Mb 1400 Mb regions 600 Mb
Pseudogenes Long interspersed nuclear Short tandem repeats

elements 640 Mb 90 Mb
Gene fragments Small interspersed nuclear

elements 420 Mb Other repeats
510 Mb
Introns,
untranslated
Long terminal repeats
250 Mb
Mobile DNA
90 Mb
Fig. 4.14 Content of the human genome (Based on IHGSC April 2003)
• Repeat sequences (those which do not code for proteins) make up about
50% of the genome (Repeat sequences are thought to maintain
chromosome structure and dynamics. By rearrangement it creates
entirely new genes or modify and reshuffle existing genes).
• About 40% of the human proteins showed similarity with fruit-fly or
worm proteins.
• Genes appear to be spread randomly throughout the genome with vast
expanses of noncoding DNA in between
• Chromosome 1 (the largest human chromosome) has 2968 genes and
the Y chromosome (smallest human chromosome) has 231 genes.
• Candidate genes were identified for numerous diseases and disorders
including breast cancer, muscle disease, deafness and blindness.
• Single nucleotide polymorphism can occur in 3 million locations.
• Every 2kb contains a microsatellite (short tandem repeat)
(Anderson et al., have decoded the entire sequence of human
mitochondria. The circular and double stranded genome contains 16569 base
pairs and 37 genes. Among them, thirteen genes code for respiratory complex
proteins and the other 24 genes represent RNA molecule for the expression of
mitochondrial genome).
The ‘Periodic Table of Life’ developed from HGP will be beneficial to
everyone in many ways. James Watson and the joint NIH-DOE genome
advisory panel were against patenting the genes. They were of the view that
public was paying for deciphering the genome and they must decide what to
do with the information.
Also scientists should have access to all available gene data for the
advancement of genome research program. In 1997, NIH established GenBank
and made everyone to access information through Internet. This encouraged
many to refrain from taking out patent on raw sequence data.
Benefits of Genome Research

The findings through various genome research programs will be beneficial in
the following areas:
Molecular Medicine
• to develop better disease diagnosis
• to detect genetic predispositions to diseases
• to design drugs based on molecular information and individual genetic
profiles
• useful for better gene therapy
Microbial Genomics
• to detect and treat pathogens speedily
• to develop new biofuels
• to protect citizens from biological and chemical warfare
• to clean up toxic waste safely and efficiently
Risk Assessment
• to evaluate the level of health risk in individuals who are exposed to
radiation or mutagens
• to detect pollutants and monitor environments
Anthropology and Evolution

• to study evolution due to germline mutations
• to study migration of different population groups
• to study mutations on the y chromosome to trace lineage and migration
of males
DNA Identification
• to identify criminals whose DNA may match evidence left at crime
scenes
• to exonerate persons wrongly accused of crimes
• to establish paternity and other family relationships
• to identify endangered and protected species
• to detect bacteria and other organisms that may pollute environment
• to match donors with recipients in organ transplant programs
• to determine pedigree for seed or livestock breeds
Agriculture and Animal Science

• to grow crops of disease and drought resistance
• high productivity
• to breed farm animals
• to develop biopesticides
STUDY QUESTIONS
1. What does the basic DNA sequencing reaction consist of?
2. Describe how DNA sequencing is done.
3. What is the role of open reading frame?
4. How do you determine the sequence of a clone?
5. What are expressed sequence tags?
6. How an expressed sequence tag is sequenced?
7. What are the methods of protein sequencing?
8. What is DNA microarray?
9. How DNA microarray works?

10. Name some of the URLs that are used as internet resources for
microarray expression analysis?
11. How is protein expression analysis performed?
12. What are the approaches used for gene discovery?
13. Name several organisms whose genomes have been successfully
sequenced.
14. How will human genome project be of benefit to various researchers
and human beings?
15. What are the contents of the human nuclear genome?
16. What are the outcomes of human genome project?
17. Mention various goals of human genome project
18. Why it is important to know about human genome?
C H A P T E R
Databases, Tools and

5
their Uses
Today biological data are gathered and stored all over the world. In order to
interpret these data in a biologically meaningful way, we need special tools
and techniques. Databases and programs allow us to access the existing
information and to compare these data to find similarities and differences.
The various Internet based molecular biology databases have their own
unique navigation tools and data storage formats.
Given a sequence, or fragment of a sequence, how to find sequences in
the database that are similar to it? Given a protein structure, or fragment,
how to find protein structures in the database that are similar to it? Given a
sequence of protein of unknown structure, how to find structures in the
database that adopt similar 3D structures? Given a protein structure, how to
find sequences in the database that correspond to similar structure? Different
data retrieval tools help to solve these problems.
5.1 IMPORTANCE OF DATABASES

A database is a logically coherent collection of related data with inherent
meaning built for certain application. It is composed of entries – discrete
coherent parcels of information. It is a general repository of information and
contains records to be processed by a program. Its contents can easily be
accessed, managed, and updated.
Databases can be searched or cross-referenced either over the Internet
or using downloaded versions on local computers or computer networks by
multiple users. The databases are electronic filing cabinets, a convenient and
efficient method of storing vast amount of information. They are assemblages
of analyzed biological information into central and shareable resources.
Databases are needed to collect and preserve data, to make data easy to
find and search, to standardize data representation and to organize data into
knowledge. The primary goals of databases are, (i) minimizing data
redundancy and (ii) achieving data independence.
Information available in these databases can be searched, compared,
retrieved and analyzed. Databases are essential for managing similar kind of
data and developing a network to access them across the globe. A large
amount of biological information is available all over the world through www
but the data are widely distributed and it is therefore necessary for scientists to
have efficient mechanisms for data retrieval.
If we have to derive maximum benefit from the deluge of sequence
information that is available today, we must establish, maintain and
disseminate databases, providing easy to use software to access the
information they contain, and design state-of-the art analysis tools to
visualize and interpret the structural and functional clues hidden in the data.
Databases of nucleic acid and protein sequences maintain facilities for a
very wide variety of information retrieval and analysis operations such as
retrieval of sequences from the data base, sequence comparison, translation of
DNA sequences to protein sequences, simple types of structure analysis and
prediction, pattern recognition and molecular graphics. Some examples of
such databases are Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) and
OMIM. ExPASy is the information retrieval and analysis system (http://
wwww.expasy.ch).
Types of Databases
There are many different database types, depending both on the nature of the
information being stored and on the manner of data storage. Databases are
broadly classified into two types, namely, generalized databases and
specialized databases. Examples of generalized databases are DNA, protein,
carbohydrate or similar databases. Examples of specialized databases are
expressed sequence tags (EST), genome survey sequences (GSS), single
nucleotide polymorphism (SNP) sequence tagged sites (STS), or similar
databases. Other specialized databases include Kabat for immunology
proteins and Ligand for enzymes reaction ligands.
Generalized databases are again broadly classified into sequence
databases and structure databases. Sequence databases contain the
individual sequence records of either nucleotides or amino acids or proteins.
Structure databases contain the individual sequence records of biochemically
solved structures of macromolecules (e.g. Protein 3 D structure).
Two principal types of databases are: (i) relational and (ii) object-
oriented. The relational database orders the data to tables made up of rows
giving specific items in the database and columns giving the features as
attributes of those items. The object-oriented database includes objects such as
genetic maps, genes, or proteins which have an associated set of utilities for
analysis which help in identifying the relationships among these objects.
Classification
More specifically databases can be classified into three types based on the
complexity of the data stored: (i) Primary database, (ii) secondary database and
(iii) composite database.
Databases, Tools and their Uses 5.3
Primary database contains data in its original form, taken as such from
the source. e.g. GenBank for genome sequences and SWISS-PROT for protein
sequences. They are also known as archival databanks. Secondary database is
a value added database which contains some specific annotated and derived
information from the primary database, e.g. SCOP, CATH, PROSITE. These are
the derived databanks that contain information collected from the archival
databanks after analysis of their contents. Composite database amalgamates a
variety of different primary database structures into one.
A redundant database is a database where more than one copy of each
sequence may be found. Databases constructed by using subsets of the
original database for reducing sampling bias are often referred to as non-
redundant databases.
Some databases that form specialized resources are called boutique
databases. They either have a species specific sequence data or contain
sequences obtained through a particular technique (e.g. Saccharomyces
genome database (SGD), Drosophila genome database, etc). In addition to
these, Bibliographic Databanks and the databanks of websites are also
available on the net.
Database Entries
Database entries comprise new experimental results, and supplementary
information or annotations. Annotations include information about the
source of data and the methods used to determine them. They identify the
investigators responsible for the discovery and cite relevant publications.
They provide links to connected information in other databanks. Curators in
databanks base their annotations on the analysis of the sequence by computer
programs.
To make sure that all the fundamental data related to DNA and RNA are
freely available, scientific journals require deposition of new nucleotide
sequences in the database as a condition for publication of an article. Similar
conditions apply to amino acid sequences, and to nucleic acid and protein
structures. EMBL (European Molecular Biology Laboratory) nucleotide
sequence database submission procedures are available at http://
www.ebi.ac.uk/embl/submission.
Sequence Formats
Many databases and software applications are designed to work with
sequence data, and this requires a standard format for inputting nucleic acid
and protein sequence information. Three of the most common sequence
formats are NBRF/PIR (National Biomedical Research Foundation/ Protein
Information Resource), FASTA and GDE. Each of these formats has facilities
not only for representing the sequence itself, but also for inserting a unique
code to identify the sequence and for making comments which may include
for example the name of the sequence, the species from which it was derived,
and an accession number for GenBank or another appropriate database.
NBRF/PIR format begins with either >P1; for protein or >N1; for nucleic
acid. FASTA format begins with only ‘>’, and the GDE format begins with ‘%’.
A feature table (lines beginning FT) is a component of the annotation of an
entry that reports properties of specific regions, for instance coding sequences
(CDS). The feature table may indicate regions that perform or affect function,
that interact with other molecules, that affect replication, that are involved in
recombination, that are a repeated unit, that have secondary or tertiary
structure and that are revised or corrected.
Database Record
A typical database record contains three sections:
(i) The header includes description of the sequence, its organism of origin,
allied literature references and cross links to related sequences in other
databases. Locus field contains a unique identifier summarizing the
function of the sequence in abbreviation and is followed by an
accession number in the Accession field. The organism field contains
the binomial of the organism and its full taxonomic classification.
(ii) The feature table contains a description of the features in the record like
coding sequences, exons, repeats, promoters, etc., for the nucleotide
sequences and domains, structure elements, binding sites, etc., for
protein sequences. If the feature table includes a coding DNA sequence
(CDS), links to the translated protein sequences are also mentioned in
the feature description.
(iii) The sequence (per se) is often more easily analyzed by the computer.
Database Management System

A database management system (DBMS) is a software that allows databases to
be defined, constructed and manipulated. It is a set of programs that manages
any number of databases. The DBMS consists of users interface to talk with,
on-line user, application developer, database engine to manage the storage
and access of physical data on disk, data dictionary to record all information
about the database, schemas, index details and access rights.
DBMS is responsible for (i) accessing data, (ii) inserting, updating and
deleting data, (iii) security, (iv) integrity, v) logging, (vi) locking, supporting
batch and online programs, (viii) facilitating backups and recoveries, (ix)
optimizing performance, (x) maximizing availability, (xi) maintaining the
catalog and directory of database objects, xii) managing the buffer pools, and
(xiii) acting as an interface to other systems’ programs.
DBMS provides data independence, data sharing, non-redundancy,
consistency, security and integrity.
Types
There are three traditional types of database management systems:
hierarchical, relational and network. Hierarchical and network models are
based on traversing data links to process a database. The data are represented
by a hierarchical structure and connection are defined and implemented by
physical address pointers within the records. They are typically used for large
mainframe systems.
Relational Database Management System

Relational database management system (RDBMS) has become popular just
because of its simple data model. Data are presented as a collection of
relations. Each relation is depicted as a table. A row corresponds to a record
and a column corresponds to a field. Each table contains only one type of
record. Each record in a table has the same number of fields. The order of the
records within a table has no significance. Columns of the tables are attributes.
Each row of a table is uniquely identical by the data values (entities) from one
or more columns. The column that uniquely identities each row is the primary
key.
Microsoft Access and Oracle are the well known RDBMS. Microsoft
Access provides a graphical user interface that makes it very easy to define
and manipulate databases. Access allows one to work with various tab
options like tables, queries, forms and reports separately. Another RDBMS
software called Postgre is used under Linux systems. The RDBMS is based
on mathematical notion, i.e. database operations are based on set theory.
The relational algebra provides a collection of operations to manipulate
relations. It supports the notion of a query or request to retrieve information
from a database in a set theoretic fashion. The relational calculus is a formal
query language. Instead of having to write a sequence of relational algebra
operations, we simply write a single declarative expression, describing the
results we want. The expressive power is similar to using relational algebra.
Many commercial languages that come these days are based on the relational
calculus; the famous one is the structured query language (SQL).
Structured Query Language

Structured Query Language (SQL) is a set of commands that gives access to a
database. SQL is a tool for organizing and retrieving data stored by a
computer database. SQL is a non-procedural language. This means that when
using SQL we have to specify what is to be done and not how to do it. It is a
high level language where one can get, modify, and manipulate information
from the database using common English words and phrases like select,
create, drop, update, insert, etc.
There are different types of commands:
(i) Data definition language (DDL): These commands create, delete and
modify database objects such as tables, views and index.
(ii) Data manipulation language (DML): These commands are used to
insert, delete and modify data.
(iii) Data query language (DQL): These are selected statements used for
retrieving data and which can be tested with DML commands.
(iv) Transcriptional control language (TCL): These commands are used to
maintain data integrity while modifying data
(v) Data control language (DCL): These commands are used for creating
and maintaining databases, partitions and assigning users to tables
and other database objects.
(vi) Data retrieval language (DRL): These commands are used to retrieve
data from a table or more than one table.
Data Mining and Knowledge Discovery

Biological database continue to grow rapidly. A huge volume of data is
available for the extraction of high level information including the
development of new concepts, concept interrelationships and interesting
patterns hidden in the databases.
Data mining is the application of specific tools for pattern discovery and
extraction. Knowledge discovery is concerned with the theoretical and
practical issues of extracting high level information (knowledge) from
volumes of low level data. It combines techniques from databases, statistics
and artificial intelligence. Knowledge discovery comprises several data pre-
processing steps as well as data mining and knowledge interpretation steps.
The goals of knowledge discovery are verification, prediction and description
(explanation).
5.2 NUCLEIC ACID SEQUENCE DATABASES

The nucleic acid sequence databases are collections of entries. Each entry has
the form of a text file. Text file contains text that can be read by human
beings as well as a computer. Text file contains data and annotations for a
single contiguous sequence. Many entries are assembled from several
published papers reporting overlapping fragments of a complete sequence.
Each entry is divided into fields. Fields are used to create indices for
relational databases. Each field is essentially a table and the field values are
indices. Unique accession numbers are allotted.
First nucleic acid sequence of yeast t-RNA with 77 bases was
announced around 1964. There are three premier institutes in the world,
which constitute the International Nucleotide Sequence Database
Collaboration. These are (i) National Centre for Biotechnology Information
(NCBT), (ii) the European Molecular Biology Laboratory (EMBL), and (iii) DNA
Data Bank of Japan (DDBJ). Data are stored and exchanged daily. The
databases contain not only sequences but also extensive annotations.
EMBL
The EMBL nucleotide sequence database (http:\\www.ebi.ac.uk/embl) is
available at the EMBL European Bioinformatics Institute, UK. It contains a
large and freely accessible collection of nucleotide sequences and
accompanying annotations. Webin is the preferred tool for submission.
EMBL contains sequences from direct author submissions and genome
sequencing groups, and from the scientific literature and patent applications.
The database is produced in collaboration with DDBJ and GenBank; each of
the participating groups collects a portion of the total sequence data reported
worldwide, and all new and updated entries are then exchanged between the
groups. The rate of growth of DNA database has been following an
exponential trend, with a doubling time now estimated to be about 9-12
months.
The format of EMBL entries is consistent with SWISS-PROT format.
Information can be retrieved from EMBL using the SRS (sequence Retrieval
System); this links the principal DNA and protein sequence databases with
motif, structure, mapping and other specialist databases and includes links to
the MEDLINE facility. EMBL may be searched with query sequences via
EMBL’s web interfaces to the BLAST and FASTA programs.
DDBJ
The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) contains
expressed sequence tags (EST) and genome sequence data.
Procedure
Open the internet browser and type the URL: www.ddbj.ac.jp. Pull the drop -
down menu at search option. Select protein or nucleotide. Type it in the TEXT
box. Note down the details from the query page which will show the accession
number, description of the query, total number of base pairs, etc.
DDBJ database is produced, maintained and distributed at the National
Institute of Genetics; sequence may be submitted to it from all corners of the
world by means of a web-based data-submission tool. The web is also used to
provide standard search such as FASTA and BLAST.
GenBank
GenBank from NCBI incorporates sequences from publicly available sources,
primarily from direct author submissions and large-scale sequencing projects.
Information can be retrieved from GenBank using the Entrez integrated
retrieval system. GenBank may be searched with user query sequences by
means of the NCBI’s Web interface to the BLAST suite of programs.
The increasing size of the database coupled with the diversity of the data
sources available, have necessitated splitting GenBank database into 17
smaller discrete divisions with a 3 letter code each (Table 5.1).
Table 5.1: The 17 subdivisions of GenBank database
Number Subdivisions Sequence subset

1. BCT Bacterial
2. PLN Plant, fungal, algal
3. INV Invertebrate Contd...
4. PRI Primate
5. ROD Rodent
6. MAM Other mammalian
7. VRT Other vertebrate
8. PHG Bacteriophage
9. VRL Virus
10. RNA Structural RNA
11. SYN Synthetic
12. UNA Unannotated
13. EST Expressed Sequence Tags
14. STS Sequence Tagged Sites
15. GSS Genome Survey Sequences
16. HTG High-throughput Genomic Sequences
17. PAT Patent
GenBank entry consists of a number of keywords, relevant associated

subkeywords, and an optional Feature Table; its end is indicated by a //
terminator. The positioning of these elements on any given time is important:
keywords begin in column 1; sub-keywords begin in column 3; a code defining
part of the Feature Table begins in column 5. Any line beginning with a blank
character is considered a continuation from the keyword or sub-keyword
above. Keywords include LOCUS, DEFINITION, NID, SOURCE, REFERENCE,
FEATURE, BASE COUNT and ORIGIN. Most submissions are made using the
web-based Bankit or standalone sequin programs.
The main purpose of the GenBank database is to provide and encourage
the scientific community to access the most up-to-date and comprehensive
DNA sequence information.
GSDB
The Genome Sequence Data Base (GSDB) is produced by the National Centre
for Genome Resources at Santa Fe, New Mexico. GSDB creates, maintains,
and distributes a complete collection of DNA sequences and related
information to meet the needs of major genome sequencing laboratories. The
format of GSDB entries is consistent with that of GenBank. The database is
accessible either via the web, or using relational database client-server
facilities.
The main sequence databases have a number of subsidiaries for the
storage of particular types of sequence data. dbEST is a division of GenBank
which is used to store expressed sequence tags (ESTs). dbGSS is used to store
single-pass genomic survey sequences (GSSs); dbSTS is used to store sequence
tagged sites (STSs) and HTG (high-throughput genomic) is used to store
unfinished genomic sequence data. OMIM (Online Mendelian Inheritance in
Man) is a comprehensive database of human genes and genetic disorders
maintained by NCBI.
Ensembl
Ensembl http://asia.ensembl.org/index.html) is intended to be the universal
information source for the human genome. The goals are to collect and
annotate all available information about human DNA sequences, link it to the
master genome sequence and make it accessible to many scientists who will
approach the data with many different points of view and requirements. To
achieve this, in addition to collecting and organizing the information, very
serious effort has gone into developing computational infrastructure. The
program used to generate this resource, eMOTIF, is based on the generation
of consensus expressions from conserved regions of sequence alignments.
Ensembl is a joint project of the European Bioinformatics Institute and
the Sanger Centre. It is organized as an open project; it encourages outside
contributions. Data collected in Ensembl include genes, SNPs, repeats and
homologies. Genes may either be known experimentally, or deduced from the
sequence. Because the experimental support for annotation of the human
genome is so variable Esnsembl presents the supporting evidence for
identification of every gene. Very extensive linking to other databases
containing related information such as OMIM or expression databases is also
possible.
Specialized Genomic Resources

In addition to the comprehensive DNA sequence databases, a variety of more
specialized genomic resources also exists. The purpose of these specialized
resources is to bring a focus (a) to species-specific genomics, and (b) to
particular sequencing techniques. The Saccharomyces Genome Database
(SGD), the TDB (TIGR) database, AceDB database are some examples. Here
is a list of web addresses for nucleotide sequence databases.
EMBL : http://www.ebi.ac.uk/embl/index.html
DDBJ : http://www.ddbj.nig.ac.jp/
GenBank : http://www.ncbi.nlm.nih.gov/genbank/
dbEST : http://www.ncbi.nlm.nih.gov/dbEST/
GSDB : http://www.ncgr.org/quick-jump/sequencing
SGD : http://www.yeastgenome.org/
UniGene : http://www.ncbi.nlm.nih.gov/unigene/
AceDB : http://www.sanger.ac.uk/software/Acedb/
Webace http://www.acedb.org/Databases/
OMIM : http://www.ncbi.nlm.nih.gov/omim
5.3 PROTEIN SEQUENCE DATABASE

Most amino acid sequence data arise by translation of nucleic acid sequence.
The primary structure of a protein is its amino acid sequence; these are stored
in primary databases as linear alphabets that denote the constituent residues.
The secondary structure of a protein corresponds to regions of local regularity,
which in sequence alignments are often apparent as well-conserved motifs;
these are stored in secondary databases as patterns (e.g. regular expressions,
fingerprints, blocks, profiles, etc.). The tertiary structure of a protein arising
from the packing of its secondary structure elements may form discrete
domains within a fold, or may give rise to autonomous folding units or
modules stored in structure databases as sets of atomic coordinates.
First protein to be sequenced was insulin in 1956 and its sequence
consisted of 51 residues. From the beginning of 1980, sequence information
started to become more abundant in scientific literature. Hence, several
laboratories started to harvest and store these sequences in central repositories.
Many primary database centers evolved in different parts of the world.
The Protein Sequence Database was developed at National Biomedical
Research Foundation at Georgetown University in the early 1960s by Margaret
Dyahoff as a collection of sequences for investigating evolutionary
relationships among proteins. From 1988, the Protein Sequence Database has
been maintained collaboratively by PIR – International, an association of
macromolecular sequence data collection centers consisting of the Protein
Information Resource (PIR) at the NBRF, the International Protein Information
Database of Japan (JIPID), and the Martinsried Institute of Protein Sequences
(MIPS). The MIPS collects and processes sequence data for the PIR-
international.
PIR Databases
The PIR is an effective combination of a carefully curated database information
retrieval access software and a workbench for investigations of sequences. The
PIR also produces the Integrated Environment for Sequence Analysis (IESA).
Its functionality includes browsing, searching and similarity analysis and
links to other databases.
The PIR maintains several databases about proteins:
(a) PIR-PSD: The main protein sequence database
(b) iProclass: Classification of proteins according to structure and function.
(c) ASDB: annotation and similarity database; each entry is linked to a list
of similar sequences.
(d) P/R-NREF: a comprehensive non-redundant collection of over 8,00,000
protein sequences merged from all available sources.
(e) NRL3D: a database of sequences and annotations of proteins of known
structure deposited in the protein Data bank.
(f) ALN: a database of protein sequence alignment.
(g) RESID: a database of covalent protein structure modifications.
PIR database is split into four distinct sections, designated as PIR1, PIR2.
PIR3 and PIR4. They differ in terms of the quality of data and levels of
annotation provided; PIR1 includes fully classified and annotated entries;
PIR2 contains preliminary entries, which have not been fully reviewed and
which may contain redundancy. PIR3 includes unverified entries, which
have not been reviewed; and PIR4 entries fall into one of the following four
categories; (i) conceptual translations of artefactual sequences, (ii) conceptual
translations of sequences that are not transcribed or translated, (iii) Protein
sequences or conceptual translations that are extensively genetically
engineered; and (iv) sequences that are not genetically encoded and not
produced on ribosomes. Programs are provided for data retrieval and
sequence searching via the NBRF-PIR database Web Page.
SWISS-PROT
The Swiss Institute of Bioinformatics (SIB) collaborates with the EMBL Data
Library to provide an annotated database of amino acid sequences called
SWISS-PROT. SWISS-PROT is a curated protein sequence database which
strives to provide high-level annotations, including descriptions of the
function of the protein and of the structure of its domains, its post-
translational modifications, variants and so on with a minimal level of
redundancy and high level of integration with other databases. SWISS-PROT
is interlinked to many other resources. The structure of the database and the
quality of its annotations places SWISS-PROT apart from other protein
sequence resources and has made it the database of choice for most research
purposes.
Entries start with an identification (ID) line and finish with a //
terminator. ID codes in SWISS-PROT are designed to be informative and
people-friendly; they take the form PROTEIN_SOURCE, where the
PROTEIN part of the code is an acronym that denotes the type of protein,
and SOURCE indicates the organism name. Since ID codes can sometimes
change an additional identifier, an accession number, is also provided, which
will remain static between database releases. The accession number is
provided on the AC line, which is computer readable. If several numbers
appear on the same AC line, the first or primary accession number is the
most current.
The DT lines provide information about the date of entry of the
sequence to the database, and details of when it was last modified. The DE
(description) line, informs us of the name, by which the protein is known. The
following lines give the gene name (GN), the organism species (OS) and
organism classification (OC) within the biological kingdom. The next section
of the database provides a list of supporting references; these can be from the
literature, unpublished information submitted directly from sequencing
projects, data from structural or mutagenesis studies and so on.
Following the references, the comment (CC) lines are found. These are
divided into themes, which tell us about the function of the protein, its post-
translational modifications, its tissue specificity, cellular location and so on.
The CC lines also point out any known similarity or relationship to particular
protein families. Database cross-reference (DR) lines follow the comment field.
These provide links to other biomolecular databases, including primary
sources, secondary databases, specialist databases, etc.
Immediately after the DR lines a list of relevant keywords (KW) are seen,
and then a number of FT lines can be found. The FT highlights regions of
interest in the sequence, including local secondary structure (such as trans-
membrane domains), ligand binding sites, and post-translational
modifications and so on. Each line includes a key, the location in the
sequence of the feature, and a comment, which might, for example, indicate
the levels of confidence of a particular annotation.
The final section of the database entry contains the sequence itself on
the SQ lines. Only single letter amino acid code is used. The structure of
SWISS-PROT makes computational access to the different information fields
both straightforward and efficient.
TrEMBL
TrEMBL (translated EMBL) was designed in 1996 as a computer-annotated
supplement to SWISS-PROT. The database benefits from the SWISS-PROT
format, and contains translations of all coding sequences in EMBL. TrEMBL
has two main sections, designated as SP-TrEMBL and REM-TrEMBL; SP-
TrEMBL (SWISS-PROT TrEMBL) contains entries that will eventually be
incorporated into SWISS-PROT, but that have not yet been manually
annotated; REM-TrEMBL contains sequences that are not destined to be
included in SWISS-PROT; these include immunoglobulins and T-cell
receptors, fragments of fewer than eight amino acids, synthetic sequences,
patented sequences, and codon translations that do not encode real proteins.
TrEMBL was designed to allow very rapid access to sequence data from
the genome projects, without having to compromise on the quality of SWISS-
PROT itself by incorporating sequences with insufficient analysis and
annotation.
PIR is the most comprehensive resource, but the quality of its
annotations is still relatively poor. SWISS-PROT is a highly structured
database that provides excellent annotations, but its sequence coverage is poor
compared to PIR.
NRL-3D
The NRL-3D database is produced by PIR from sequences extracted from the
Protein Data Bank (PDB). The titles and biological sources of the entries
conform to the nomenclature standards used in the PIR. Bibliographic
references and MEDLINE cross references are included, together with
secondary structure, active site, binding site and modified site annotations,
and details of experimental methods, resolution, R-factor, etc. Keywords are
also provided.
NRL-3D is a valuable resource, as it makes the sequence information in
the PDB available both for keyword interrogation and for similarity searches.
The database may be searched using the ATLAS retrieval system, a multi-
database information retrieval program specifically designed to access
macromolecular sequence databases.
5.4 STRUCTURE DATABASES
Structure Databases archive, annotate and distribute sets of atomic
coordinates. They store a collection of 3 dimensional biological
macromolecular structures of proteins and nucleic acids. The last established
database for protein structures is Protein Data Bank (PDB). The website is
http://www.rcsb.org/pdb/home/home.do
This is the single world-wide repository of structural data and is
maintained by Research Collaborators for Structural Bioinformatics (RCSB)
at Rudgers University, New Jersey, USA. (The associated nucleic acid
databank (NDB) is also maintained here). An equivalent European database
is the Macromolecular Structure Database (MSD) maintained by the
European Bioinformatics Institute. The website for MSD is http://
www.ebi.ac.uk/Databases/structure.html RCSB and MSD databases contain
the same data.
The PDB entry normally contains the following informations: the name
of the protein, the species it comes from, who solved the structure, references
to publications, describing the structure determination, experimental details
about the structure determination, the amino acid sequence, any additional
molecules and atomic coordinates. MSD includes a search tool called OCA,
which is a browser database for protein structure and function, integrating
information from numerous databanks. Another useful information source
available at the EBI is the database of Probable Quaternary Structures (PQS)
of biologically active forms of proteins.
Structural Classifications
Many proteins share structural similarities, reflecting, in some cases,
common evolutionary origins. The evolutionary process involves
substitutions, insertions and deletions in amino acid sequences. For distantly
related proteins, such changes can be extensive, yielding folds in which the
numbers and orientations of secondary structures vary considerably.
However, where, for example, the functions of proteins are conserved, the
structural environments of critical active site residues are also conserved. With
a view to better understand sequence structure relationships, struture
classification schemes have been evolved.
Several websites offer hierarchical classifications of the entire PDB
according to the folding patterns of the proteins.
(i) SCOP : Structural classification of Proteins
(ii) CATH : Class/ Architecture/ Topology/ Homology
(iii) DALI : Based on extraction of similar structure from distance matrices.
(iv) CE : a database of structural alignments.
SCOP Database
The SCOP database describes structural and evolutionary relationships
between proteins of known structure. Since current automatic structure
comparison tools cannot reliably identify all such relationships, SCOP has
been designed using a combination of manual inspection and automated
methods. Proteins are classified in a hierarchical fashion to reflect their
structural and evolutionary relatedness. Within the hierarchy there are many
levels, but principally these describe the family, super family and fold.
Proteins are clustered into families with clear evolutionary relationships
if they have sequence identities of more than 30%. Proteins are placed in
super families when, in spite of low sequence identity, their structural and
functional characteristics suggest a common evolutionary origin. Proteins are
suggested to have a common fold if they have the same major secondary
structures in the same arrangement and with the same topology, whether or
not they have a common evolutionary origin. SCOP is accessible for keyword
interrogation via the MRC Laboratory Web Server.
CATH Database
The CATH (lass, architecture, topology, homology and sequence) database is
largely derived using automatic methods, but manual inspection is necessary
where automatic methods fail. Different categories within the classification
are identified by means of both unique numbers and descriptive names.
There are five levels (class, architecture, topology, homology and sequence)
within the hierarchy.
Class is derived from gross secondary structure content and packing.
Architecture describes the gross arrangement of secondary structures.
Topology gives a description that encompasses both the overall shape and the
connectivity of secondary structures. Homology groups domains that share
more than 35% sequence identity and are thought to share a common ancestor.
Sequence provides the final level within the hierarchy whereby structures
within homology groups are further clustered on the basis of sequence
identity. CATH is accessible for keyword interrogation via UCL’s Biomolecular
Structure and Modeling Unit Web server.
CATH database is a protein structure database residing at University
College, London. Proteins are classified first into hierarchical levels by class,
similar to the SCOP classification except that α/β and α + β proteins are
considered to be in one class. Instead of a fourth class for α + β proteins, the
fourth class of CATH comprises proteins with few secondary structures.
Following class, proteins are classified by architecture, fold superfamily and
family.
Composite Databases
A composite database is a database that amalgamates a variety of different
primary sources. Composite databases render sequence searching much
more efficient, because they obviate the need to interrogate multiple
resources. The interrogation process is streamlined still further if the composite
has been designed to be non-redundant, as this means that the same sequence
need not be searched more than once.
Different strategies can be used to create composite resources. The final
product depends on the chosen data sources and the criteria used to merge
them. The choice of different sources and the application of different
redundancy criteria have led to the emergence of different composites, each
of which has its own particular format. The main composite databases are
NRDB, OWL, MIPSX and SWISS-PROT+ TrEMBL.
NRDB (Non-Redundant Database) is comprehensive and contains up-
to-date information. OWL is a non-redundant protein database with a
priority with regard to the level of annotation and sequence validation.
MIPSX database contains information of only unique copies. SWISS-PROT +
TrEMBL provide a resource that is both comprehensive and minimally
redundant.
NDB Database
The Nucleic acid structure Database (NDB) (http://ndbserver.rutgers.edu/)
assembles and distributes structural information about nucleic acids. In
addition to information regarding nucleic acids it maintains a DNA-binding
protein database. Available information includes coordinates and structure
factors, an archive of nucleic acid standards and an atlas of nucleic acid
containing structures that highlight special aspects of each structure in the
NDB. It also maintains information regarding intrinsic correlations between
structural parameters.
CSD Database
Cambridge structural Database (CSD) contains comprehensive structural data
for organic and organic-metallic compounds studied by X-ray and neutron
diffraction. It contains 3D atomic coordinate information as well as associated
bibliographic, chemical and crystallographic data. It is equipped with
graphical, search, retrieval, data manipulation and visualization software.
BMRB Database
BioMagResBank (BMRB) contains data from NMR studies of proteins,
peptides and nucleic acids (www.bmrb.wisc.edu). It is used to deposit the
data that is used to derive the NMR restraints and the coordinates deposited
into the PDB. It contains NMR parameters that are measures of flexibility and
dynamics. It also contains data on measured NMR parameters such as
chemical shifts, coupling constants, dispolar couplings, T1 values, T2 values,
heteronuclear NOE values, Se (order parameters), hydrogen exchange rates
and hydrogen exchange protection factors.
3Dee and FSSP databases

3Dee is a database of protein domain definitions. FSSP (fold classification
based on structure-structure alignment of proteins) database is based on
automatic all-against-all 3D structure comparisons of all the entries of the
PDB.
FSSP database contains a database of representative fold for all the
structures in the PDB. The representative folds are subjected to a hierarchical
clustering algorithm to construct a fold tree based on structural similarities.
The FSSP database is based on structure alignment of all pair-wise
combinations of the proteins in the Brookhaven structural database by the
structural alignment program DALI.
Other Databases
Molecular Modeling Database (MMDB) is a database containing
experimentally determined structures extracted from PDB. Its organization is
based on the concept of neighbors-links to sequential and structural
neighbors. MMDB categorizes proteins of known structure in the
Brookhaven PDB into structurally related groups by the VAST (Vector
Alignment Search Tool) structural alignment program. VAST aligns three
dimensional structures based on a search for similar arrangements of
secondary structural elements. MMDB provides a method for rapidly
identifying PDB structures that are statistically out of the ordinary.
Conserved Domain Database (CDD) is a database of conserved domain
alignments with links to three-dimensional structures of domains. Chemico-
physical AMino acidic Parameter databank (CHAMP) is an amino acidic
parameters data bank containing 32 different series of physico-chemical
parameters of amino acids. It is integrated with FAST. The Enzyme-Reaction
Database links a chemical structure to amino acid sequences of enzymes that
recognize the chemical structure as their ligand. The chemical structures and
chemical names are registered in the chemical-structure database on the
MACCS system.
The enzymes are registered in the database with NBRF-PIR entry codes.
The enzymes’ sequences in the database are divided into clusters and a
conserved sequence is extracted from each cluster using multiple sequence
alignment. These conserved sequences are used to construct motifs.
Thermodynamic Database for Proteins and Mutants (ProTherm) is a
collection of numerical data for studying the relationship between structure,
stability and function. It contains thermodynamic parameters such as
unfolding Gibbs free energy change, enthalpy change, heat capacity change,
transition temperature, etc. It also contains information about activity,
secondary structure, surface accessibility, measuring methods and
experimental conditions such as pH, temperature, and buffer ion and protein
concentration. ProTherm is linked with PIR and SWISS-PROT, PDB, PMD and
PubMed.
The SARF (spatial arrangement of backbone fragments) database also
provides a protein database categorized on the basis of structural similarity.
Secondary Databases
Primary database search tools are effective for identifying sequence
similarities, but analysis of output is sometimes difficult and cannot always
answer some of the more sophisticated questions of sequence analysis. Hence
secondary database search tools are used. Depending on the type of analysis
method using secondary data bases, relationships may be elucidated in
considerable detail, including superfamily, family, subfamily, and species-
specific sequence levels.
The principle behind the development of secondary databases is that
within multiple alignments, there are many conserved motifs that reflect
shared structural or functional characteristics of the constituent sequences.
The simplest approach to pattern recognition is to characterize a family by
means of a single conserved motif, and to reduce the sequence data within
the motif to a consensus or regular expression pattern. Regular expressions
are the basis of the PROSITE database.
Many secondary databases, which contain the fruits of analysis of the
sequences in the primary sources, are also available. Many secondary
databases such as PROSITE, Profiles, PRINTS, Pfam, BLOCKS, IDENTIFY
use SWISS-PROT as primary source. PROSITE stores Regular Expression
(patterns); Profiles stores weighted matrices (profiles); PRINTS stores aligned
motifs (fingerprints). Pfam stores hidden Markov Models (HMMs). BLOCKS
stores aligned motifs (blocks), and IDENTIFY stores fuzzy regular
expressions (patterns).
The type of information stored in each of the secondary databases is
different. Yet these resources have arisen from a common principle; namely,
that homologous sequences may be gathered together in multiple alignments,
within which are conserved regions that show little or no variation between
the constituent sequences. These conserved regions or motifs, usually reflect
some vital biological role (i.e. are somehow crucial to the structure or
function of the protein).
One of the aims of sequence analysis is to design computational methods
that help to assign functional and structural information to uncharacterized
sequences; this is achieved by means of primary database searches, the goal of
which is to identify relationships with already known sequences. Within a
database, the challenge is to establish which sequences are related (true-
positive) and which are unrelated (true-negatives). To improve diagnostic
performance one has to capture most of true-positive family members and to
include no or few false positives.
PROSITE Database
PROSITE was the first secondary database to be developed. The rationale
behind its development was that protein families could simply and effectively
be characterized by the single most conserved motif observable in a multiple
alignment of known homologues, such motifs usually encoding key biological
functions (e.g. enzyme active sites, ligand or metal binding sites, etc.).
Searching such a database should, in principle, help to determine to which
family of proteins a new sequence might belong, or which domain or
functional site it might contain.
PRINTS Database
Most protein families are characterized not by one, but by several conserved
motifs. It therefore makes sense to use many, or all, of these to build diagnostic
signatures of family membership. This is the principle behind the development
of the PRINTS fingerprint database. Fingerprints inherently offer improved
diagnostic reliability over single-motif methods by virtue of the mutual context
provided by motif neighbours; in other words, if a query sequence fails to
match all the motifs in a given fingerprint, the pattern of matches formed by the
remaining motifs still allows the user to make a reasonably confident
diagnosis.
BLOCKS Database
A multiple-motif database, called BLOCKS, was created by automatically
detecting the most highly conserved regions of each protein family.
The limitations of regular expression in identifying distant homologues
led to the creation of a compendium of profiles. The variable regions between
conserved motifs also contain valuable sequence information. Here the
complete sequence alignment effectively becomes the discriminator.
HMMs
An alternative to the use of profiles is to encode alignments in the form of
Hidden Markov Models (HMMs). These are statistically based mathematical
treatments, consisting of linear chains of match, delete or insert states that
attempt to encode the sequence conservation within aligned families. A
collection of HMMs for a range of protein domains is provided by the Pfam
database.
IDENTITY, KEGG and MEDLINE Databases

Another automatically derived tertiary resource, derived from BLOCKS and
PRINTS is IDENTIFY. The Kyoto Encyclopedia of Genes and Genomes (KEGG)
is the database of metabolic pathways. It collects individual genomes, gene
products and their functions with biochemical and genetic information.
MEDLINE integrates the medical literature including very many papers
dealing with molecular biology. It is included in PubMed, a bibliographic
database offering abstracts of scientific articles.
Web Addresses:
Gen Bank : http://www.ncbi.nlm.nih.gov/genbank/
EMBL : http://www.ebi.ac.uk/embl/index.html
DDBJ : http://www.ddbj.nig.ac.jp/
PIR : http://www.pir.georgetown.edu/
MIPS : http://www.mips.biochem.mpg.de/
SWISS-PROT : http://pir.georgetown.edu/pirwww/dlinfo/nr13d.h
OWL : http://www.bioinf.man.ac.uk/dbbrowser/OWL/
PROSITE : http://www.expasy.ch/prosite/
PRINTS : http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
BLOCKS : http://www.blocks.fhcrc.org/
Profiles : http://www.isrec.isb-sib.ch/software/PFSCAN_form.html
Pfam : http://www.sanger.ac.uk/software/Pfam/
IDENTIFY : http://dna.stanford.EDU/identify/
Proweb : http://www.proweb.org/kinetin/ProWeb.html
SCOP : http://scop.mrc-lmb.cam.ac.uk/scop/
CATH : http://www.biochem.ucl.ac.uk/bsm/cath/
5.5 BIBLIOGRAPHIC DATABASES AND VIRTUAL LIBRARY

Publication is at the core of every scientific endeavor. It is the common process
whereby scientific information is reviewed, evaluated, distributed and entered
into the permanent record of scientific progress. Bibliographic databases (also
known as literature database or knowledge databases) contain published
articles, abstracts and free selective full text papers with links to individual
records. Though there are a number of literature databases, PubMed and
Agricola are extensively used by scientists as they provide updated
information from different links.
PubMed
PubMed is maintained by the National Library of Medicine (US) and includes
a bibliographic database MEDLINE as well as links to selective full text
articles on sites maintained by journal publishers. It offers abstracts of
scientific articles and is integrated with other information retrieval tools of the
National Centre for Biotechnology Information. Scientific journals place their
table of contents and in some cases, entire issues, on web sites. PubMed
records are relational in nature and query results include links to the
GenBank, PDB, etc. PubMed databases can be searched at the following
websites:
http://www.ncbi.nlm.nih.gov/PubMed/
http://www.pubmedcentral.nih.gov
AGRICOLA
AGRICOLA stands for Agricultural online access. It is a bibliographic
database of citations to the agricultural literature created by the National
Agricultural Library and its cooperators. It includes publications and
resources from all the disciplines related to agriculture, such as, veterinary
science, plant science, forestry, aquaculture and fisheries, food and human
nutrition, earth and environmental science. The database can be searched at
the following website: http://www.nal.usda.gov/ag98/
Virtual Library
Virtual library on the net provides access to web sites that are a storehouse of
information. It contains a collection of links to various online journals and
bibliographic databases. Virtual library can be classified into various groups
with links to various online journal, bibliographic databases, institute library
access, forums and associations, tutorial sites, educational sites, grants and
funding resources, government and regulatory bodies, etc. The most famous
virtual library site in the web is: http://www.vlib.org
There are also further collections of virtual libraries on various topics
such as microbiology, biochemistry, etc. Many publishers have their own
online journals available on sites (e.g. Nature: www.nature.com). These sites
provide free access to the table of contents and abstracts.
5.6 SPECIALIZED ANALYSIS PACKAGES

Homology searching is only one aspect of the analysis process. Numerous
other research tools are also available, including hydropathy profiles for the
detection of possible trans membrane domains and/or hydrophobic protein
cores; helical wheels to identify putative amphipathic helices; sequence
alignment and phylogenetic tree tools for charting evolutionary relationships;
secondary structure prediction plots for locating α-helices and β-strands; and
so on.
Because of the need to employ a range of techniques for effective sequence
analysis, software packages have been developed to bring a variety of these
methods together under a single umbrella, obviating the need to use different
tools with different interfaces, with different input requirements and different
output formats.
Major releases of DNA and protein sequence databases occur every three
to four months. In the meantime, newly determined sequences are added to
daily update files. To keep an in-house database up-to-date, synchronized FTP
scripts are used (e.g. using scheduling software such as Cron under UNIX).
With such a system, it is relatively simple to track individual databases, but it
becomes unwieldy when several databases (e.g. GenBank, EMBL, SWISS-
PROT, PIR) have to be monitored and merged with proprietary information.
Further, if new databases evolve, it is considered advantageous also to bring
them in-house; hence existing scripts must be updated to incorporate the new
resources.
There are a number of well-known packages that offer a fairly complete
set of tools for both DNA and protein sequence analysis. These suites have
evolved and grown to be fairly comprehensive over a period of years.
GCG Package
The most widely known, commercially available sequence analysis software is
the GCG (Oxford Molecular Group). This was developed by the Genetic
Computer Group at Wisconsin (575 Science Drive, Medison, Wisconsin, USA
53711) primarily as a set of analysis tools for nucleic acid sequences, but
which in time included additional facilities for protein sequence analysis.
Within GCG, many of the frequently used sequence databases can be
accessed (e.g. GenBank, EMBL, PIR and SWISS-PROT) as can a number of
motif and specialist databases (such as PROSITE; TFD, the transcription factor
database; and REBASE, the restriction enzyme database). A particular strength
of the system is that it can also be relatively easily customized to accept
additional, user-specific databases. Within the suite, EMBL and GenBank are
split into different sections, allowing users to minimize search time by
directing queries only to relevant parts of the databases. Thus, for example,
sequences in GenBank and EMBL may be searched either collectively or
separately or by defined taxonomic categories (e.g. viral, bacterial. Rodent, etc.).
The sequence databases have their own distinct formats, so these must be
converted to the GCG format for use with its programs. Likewise, all data files
imported to the suite for analysis must adhere to the GCG format. The facilities
include tools for pairwise similarity searching, multiple sequence alignment,
evolutionary analysis, motif and profile searching, RNA secondary structure
prediction, hydropathy, and antigenecity plots, translation, sequence assembly,
restriction site mapping and so on.
EGCG Package
EGCG or Extended GCG started at EMBL in Heiddberg as a collection of
programs to support EMBOL’s research activities. There are more than 70
programs in EGCG, covering themes such as fragment assembly, mapping,
database searching, multiple sequence analysis, pattern recognition, nucleotide
and protein sequence analysis, evolutionary analysis, and so on.
Staden Package
The Staden Package is a set of tools for DNA and protein sequence analysis. It
does not provide databases, but the software works with the EMBL database
and other databases in a similar format. The package has a windowing
interface for UNIX workstations. Amongst its range of options, the suite
provides utilities to define and to search for patterns of motifs in proteins and
nucleic acids (for example, specific individual routines allow searching for
mRNA splice junctions, E. coli promoters, tRNA genes, etc. and users may
define equally complex patterns of their own). A particular strength of the
Staden Package lies in its support for DNA sequence assembly.
It provides methods for all the pre-processing required for data from
fluorescence-based sequencing instruments, including trace viewing (TREV),
quality clipping (PREGAP4) and vector removal (PREGAP4, VECTOR_CLIP);
a range of assembly engines; and powerful contig editing and finishing
algorithms (GAP4). A new method for detecting point mutation is also there
(TRACE_DIFF, GAP4). For analysis of finished DNA sequences, the package
includes NIP4, and for comparing DNA or protein sequences, SIP4; these
routines also provide an interface to the sequence libraries. The new interactive
programs TEV, PREGAP4, GAP4, NIP4 and SIP4 have graphical user-interfaces,
but the package also contains a large number of older, but still useful,
programs that are text-based.
Lasergene Package
Lasergene is a PC-based package that provides facilities for coding analysis,
pattern and site matching, and RNA/DNA structure and composition
analysis; restriction site analysis; PCR primer and probe design; sequence
editing; sequence assembly and contig management; multiple and pairwise
sequence alignment (including doplots); protein secondary structure
prediction and hydropathy analysis; helical wheel and net creation; and
database searching. Lasergene is available for windows or Macintosh, for
single users or for networked-PC environments.
There are numerous other packages available, which tend to concentrate
on particular areas of sequence analysis of DNA. For example:
Sequencher Package
Sequencher is a sequence assembly package for the Macintosh, used by many
laboratories engaged in large-scale sequencing efforts. The package takes raw
chromatogram data and converts it into contig assemblies; other functions
include restriction site or ORF analysis, heterozygote analysis for mutation
studies, vector and transposon screening, motif analysis, silent mutation tools,
sequence quality estimation, and visual marking of edits to ensure data
integrity.
Vector, NTI Package

Vector NTI, for windows 3.1 supported by the American Type Culture
Collection (ATCC) and InforMax, Inc., is a knowledge-based package designed
to expedite cloning applications. It can automatically optimize the design of
new DNA constructs and recommend cloning steps. The user can specify
preferences for process such as fragment isolation, modification of termini and
ligation. The system incorporates about 3000 rules for genetic engineering.
MacVector Package
MacVector is a molecular biology system that exploits the Macintosh user
interface to create an easy-to-use environment for manipulation and analysis
of DNA and protein sequence data. The package implements the five BLAST
search functions, and includes ClustalW for sequence alignment, and an icon-
managed sequence editor that is integrated with the program’s molecular
biology functions (e.g. translation, restriction analysis, primer and probe
analysis, protein structure prediction, and motif analysis). Facilities are also
provided to compute predicted sequence-based melting curves for DNA and
RNA structures.
Intranet packages: The future for commercial solutions lies in providers
understanding the key issues facing the large industrial user. Most companies
now have intranets and support the use of HTTP and Internet Inter-ORB
Protocol (IIOP). Bioinformatics solutions must fit as easily seamlessly as
possible into this environment. Most companies need to implement integration
throughout the research operation. Most industrial bioinformatics teams
devote some resources to development and maintenance of internal web
servers that replicate the services available at public bioinformatics sites. Two
companies, NetGenics Inc. and Pangea Systems Inc., provide bioinformatics
systems that offer the prospect of service integration via the intranet.
SYNERGY
SYNERGY, developed by NetGenic, Inc., Cleveland, ohio, is an object-oriented
approach using Java, CORBA, and an object-oriented database, to implement a
flexible environment for managing bioinformatics projects. SYNERGY
integrates standard tools into its portfolio through the use of CORBA
‘Wrappers’, which present a streamlined interface between the tool and the
SYNERGY system. In this way, the developers are able to incorporate a
number of standard programs very rapidly and users of the system are able to
incorporate their own tools by implementing CORBA wrappers in-house.
Pangea Systems
GeneMill, GeneWorld and GeneThesaurus are the developments of Pangea
Systems Inc., Oakland, California. These are web-based tools that are back-
ended by a relational database. The overall system is aimed at high-
throughput sequencing projects and other large-scale industrial genomics
projects, including, for example, GeneMill, a sequencing workflow database
system for managing sequencing projects; Geneworld, a tool for analysis of
DNA and protein sequences; and GeneThesaurus, a sequence and annotation
data subscription service, allowing access to public data and integration with
proprietary data. The system is modular and allows interfaces to in-house
software to be built easily, using an open programming interface, PULSE
(Pangea’s Unified Life Science Environment).
EMBOSS Package
European Molecular Biology Open Software Suite (EMBOSS) is an integrated
set of packages and tools for sequence analysis being specifically developed
for the needs of the Sanger Centre and the EMBnet user communities.
Application of the package include: EST clustering, rapid database searching
with sequence patterns, Nucleotide sequence pattern analysis, code usage
analysis, Gene identification tools, Protein motif identification.
Alfresco Package
Alfresco is a visualization tool that is being developed for comparative genome
analysis, using ACEDB for data storage and retrieval. The program compares
multiple sequences from similar regions in different species, and allows
visualization of results from existing analysis programs, including those for
gene prediction, similarity searching, regulatory sequence prediction, etc.
DALI Program
DALI (Distance matrix Alignment) program is used to quantify proteins with
folding patterns similar to that of a query structure. L. Holm and C. Sander
wrote this program. It runs fast enough to carry out routine screens of the
entire protein Data Bank for structures similar to a newly determined structure,
and even to perform a classification of protein domain structures from an all-
against-all comparison.
To meet the need for effective software technique for data analysis, many
software packages have been developed. These packages are highly specific in
their approach and can be easily loaded as per the requirements of the user
(Table 5.2).
Table 5.2: Some well known packages with a set of tools for DNA and protein
sequence analysis
Package Scope
Staden Analyses of DNA and protein sequence. It has a window interface for
UNIX workstations.
Genemill, Gene World, Genemill package system manager sequence projects. Gene World
Gene Thesaurus analyses DNA and protein sequences. Gene Thesaurus allows access to public
data and integration with proprietary data.
Lasergene Coding analysis, pattern site matching, structure and comparison
analysis of RNA/DNA, restriction site analysis, PCR primer and probe
designing, sequence editing, sequence assembly, multiple and pairwise
sequence analysis-helical wheel and net creation, and database
searching.
Synergy An object oriented package, uses java, COBRA and object-oriented
database to implement a flexible environment for managing
bioinformatics projects.
CINEMA A colour Interactive Editor for Multiple Alignments, an internet package
written in Java, provides facilities for motif identification, database
searching (using BLAST), 3d structure visualization, generation of
dotplots and hydropathy profiles, six-frame translation.
EMBOSS The European Molecular Biology Open Software suite specifically
developed for easy integration of other public domain packages and
other applications like EST clustering, nucleotide sequence pattern
analysis, codon usage analysis, gene identification tools, protein motif
identification and rapid databases searching with sequence pattern.
EGCG Developed by Genetics Computer Group, Wisconsin, an extended
version of GCG, has more than 70 programs including fragment
assembly, mapping, database-searching, multiple sequence analysis,
pattern recognition, nucleotide and protein sequence analysis,
evolutionary analysis, etc.
ExPASy ExPASy is the SIB Bioinformatics Resource Portal which provides
access to scientific databases and software tools (i.e., resources) in
different areas of life sciences including proteomics, genomics,
phylogeny, systems biology, population genetics, transcriptomics, etc.
KEGG KEGG is a database resource for understanding high-level functions
and utilities of the biological system, such as the cell, the organism and
the ecosystem, from molecular-level information, especially large-scale
molecular datasets generated by genome sequencing and other high-
throughput experimental technologies
5.7 USE OF DATABASES
The available information on the biological function of particular sequences in
model organisms may be exploited to predict the function of similar gene in
other organisms. The sequence of the gene of interest is compared to every
sequence in a sequence database, and the similar ones are identified. If a query
sequence can be readily aligned to a database sequence of known function,
structure or biochemical activity, the query sequence is predicted to have the
same function, structure or biochemical activity. As a rough rule, if more than
one-half of the amino acid sequence of query and database proteins is identical
in the sequence alignments, the prediction is very strong.
A common reason for performing a database search with a query
sequence is to find a related gene in another organism. For a query sequence of
unknown function, a matched gene may provide a clue to the function.
Alternatively, a query sequence of known function may be used to search
through sequences of a particular organism to identify a gene that may have
the same function.
Web addresses:
GCG : http://www.gcg.com/
EGCG : http://www.sanger.ac.uk/software.EGCG/
Staden : http://www.mrc-lmb.cam.ac.uk/pubseq/
NetGenics : http://www.netgenics.com/
Pangea Systems : http://www.pangeasystems.com/
CINEMA : http://www.bioinchem.ucl.ac.uk/bsm/dbbrowser/CENEMA2.1
EMBOSS : http://www.sanger.ac.uk/Software/EMBOSS/
Alfresco : http://www.sanger.ac.uk/Users/nic/alfresco.html
STUDY QUESTIONS
1. What are databases?
2. What are the types of databases?
3. What are the functions of databases?
4. What are the nucleic acid sequence databases? Give some examples.
5. What are protein sequence databases? Give some examples.
6. What are protein sequence databases about protein maintained by PIR?
7. What are structure databases? Give some example.
8. What is bibliographic database? Give some examples.
9. What is virtual library?
10. Give some names of specialized analysis packages and their uses?
11. What is database management system?
12. What are the types of database management system?
13. What is data mining?
14. What are the goals of Ensembl?
C H A P T E R
Sequence Alignment
6
The method used to analyze the similarities and differences at the level of
individual bases or amino acids with the aim of inferring structural,
functional and evolutionary relationships among the sequences is called
sequence alignment.
In simple words it is the identification of residue-residue
correspondence; any assignment of correspondence that preserves the order of
the residues within the sequences is an alignment.
The sequences of biological macromolecules are the products of
molecular evolution. When the sequences share a common ancestral
sequence, they tend to exhibit similarity in their sequences, structures and
biological functions. When a new sequence is found whose function is not
known, but, if similar sequences could be found in the databases for which
functional or structural information is available, then this can be used as a
basis of a prediction of function or structure of the new sequence.
Sequence alignment is the procedure of comparing two (pairwise
alignment) or more (multiple sequence alignment) sequences by searching
for a series of individual characters or character patterns that are in the same
order in the sequences. Two sequences are aligned by placing them in two
rows. Identical or similar characters are placed in the same column.
Nonidentical or dissimilar characters are either placed in the same column as
a mismatch, or may be placed opposite a gap in the other sequence.
The advent of high-throughput automated fluorescent DNA sequencing
technology has led to the rapid accumulation of sequence information and
provides the basis for abundant computationally derived protein sequence
data. Analysis of DNA sequences can throw light on phylogenetic
relationships, restriction sites, intron/exon prediction and gene structure and
protein coding sequence through open reading frame analysis.
6.1 ALGORITHM
Algorithm is a biological sequence of steps by which a task can be performed.
It is a set of rules for calculating or solving a problem which normally is
carried out by a computer program. A program is the implementation of an

algorithm. Thus algorithm is a complete and precise specification of a
method for solving a problem.
Five important features of an algorithm are:
(i) An algorithm must stop after a finite number of steps.
(ii) All steps of an algorithm must be precisely defined.
(iii) Input to the algorithm must be specified.
(iv) Output to the algorithm must be specified.
(v) It must be very effective (operation of the algorithm must be basic)
Genetic Algorithm
The genetic algorithm is a general type of machine-learning algorithm
developed by computer scientists which has no direct relationship to biology.
It produces alignments by attempted simulation of the evolutionary changes
in sequences.
6.2 GOALS AND TYPES OF ALIGNMENT

One goal of sequence alignment is to enable us to determine whether two
sequences display sufficient similarity such that an inference of homology is
justified. As genetic information is passed on from one generation to the next,
the information gets altered slightly during the process of copying.
The changes that occur during divergence from the common ancestor
can be categorized as substitutions, insertions and deletions. These changes
can accumulate as the generations pass by. After several thousand
generations, considerable amount of divergences may have set in.
Comparison of two supposedly homologous sequences will show how much
evolutionary changes had taken place between them.
Global vs. Local alignment

There are two types of alignment: global alignment and local (Fig. 6.1). In
global alignment an attempt is made to align the entire sequence, using as
many characters as possible, up to both ends of each sequence. In local
alignment, stretches of sequence with the highest density of matches are
aligned, thus generating one or more islands of matches or subalignments in
the aligned sequences.
L G P S S K Q T G K G S S R I W D N
Global
Alignment
L N I T I K S A G K G A M R L G D A
T G K G
Local Alignment
A G K G
Fig. 6.1 Distinction between global and local alignments of two sequences
Sequence Alignment 6.3
Sequences that are quite similar and approximately the same length are
suitable candidates for global alignments. Local alignments are more suitable
for aligning sequences that are similar along some of their lengths but
dissimilar in others, sequences that differ in length or sequences that share a
conserved region or domain.
In the figure 6.1 the global alignment is stretched over the entire sequence
length to include as many matching amino acids as possible up to and
including the ends of sequences. Vertical bars between the sequences indicate
the presence of identical matches. In the local alignment, alignment stops at
the ends of regions of identity or strong similarity. Priority is given to finding
these local regions.
There are two types of alignment: global alignment and local alignment.
Global alignment considers the similarity across the full extent of the
sequence. Local alignment focuses on regions of similarity in parts of the
sequence only.
A search for local similarity may produce more biologically meaningful
and sensitive results than a research attempting to optimize alignment over
the entire sequence length because usually the functional sites are localized
to relatively short regions, which are conserved irrespective of deletions or
mutations in intervening parts of the sequence.
Optimal Alignment
Optimal alignment is an alignment which maximizes the score, that which
exhibits the most correspondences, and the least differences. Suboptimal
alignment is an alignment where the maximization of the score is below the
optimum level. In an optimal alignment, non identical characters and gaps
are placed to bring as many identical or similar characters as possible into
vertical register.
Optimal alignments provide useful information to biologists concerning
sequence relationships by giving the best possible information as to which
characters in a sequence should be in the same column in an alignment and
which are insertions in one of the sequences (or detections on the other). This
information is important for making functional, structural and evolutionary
predictions on the basis of sequence alignment.
Parametric Sequence Comparison and Bayesian

Statistical Method
Parametric sequence comparisons refers to computer methods that are used to
find a range of possible alignments in response to varying the scoring system
used for matches, mismatches, and gaps. There is also an effort to use scores
such that the results of global and local types of sequence alignments provide
consistent results. Some of the programs are Xparal and Bayes block aligner.
Bayesian statistical methods are also used to produce alignments between
pairs of sequences and to calculate distances between sequences.
6.3 STUDY OF SIMILARITIES

Sequence similarity searches of a database enable us to extract sequences that
are similar to a query sequence. The extracted sequence, for which functional
and structural information is available will help us to predict the structure
and function of the query sequence. Generally a database scanning is done to
find homologs. The speed and sensitivity of a search not only depends on the
program used, but also in the computer hardware, the database being
scanned and the length of the target sequence. In a typical database scan, the
sequence under investigation is aligned against each database entry.
The aligning of two sequence is termed as pairwise alignment.
Sequence similarity searches employ a query sequence (called |he probe) and
the subject sequence. The relationship between the two can be quantified and
their similarity assessed. To identify an evolutionary relationship between a
newly determined sequence and a known gene family, the extent of shared
similarity is assessed. If the degree of similarity is low, the relationship is
putative.
Computer assisted dynamic programming algorithms are similarity
searching methods which involve matching of the query sequence to the
sequence deposited in the database. A similarity score is calculated by
measuring the closeness between the residues (closeness is the number of
nucleotide bases or amino acid residues that are similar between the
compared sequences).
Needleman-Wunch algorithm is used in global alignment to find
similarity between sequences across the entire length. This is a matrix based
approach. Smith-Waterman algorithm is used in local alignment to find
similarity between sequences across only a small part of each sequence. This
is also a matrix based approach. This is often quoted as benchmark when
comparing different alignment techniques.
Pairwise comparison is a fundamental process in sequence analysis. A
sequence consists of letters selected from an alphabet. The complexity of the
alphabet is 4 for DNA and 20 for proteins. Sometimes additional characters
are used in an alphabet to indicate a degree of ambiguity in the identity of a
particular residue or base. A simple approach to determine similarity
between two sequences is to line up the sequences against each other and
insert additional characters to bring the two strings into vertical alignment
(Fig. 6.2).
Sequence a OPQRSTUVW
-
Sequence b OPQR TUVW
Fig. 6.2 Alignment of two sequences with vertical bars and gaps. Vertical bar (|) denotes
identical matches and horizontal bar (-) denotes gap.
Gaps and Mismatches
We could score the alignment by counting how many positions match
identically at each position. The process of alignment can be measured in
terms of the number of gaps introduced and the number of mismatches
remaining in the alignment. A comprehensive alignment must account fully
for the positions of all residues in both sequences. This means that many gaps
may have to be placed at positions that are not strictly identical. In such cases,
the positioning of gaps in the alignment becomes numerous and more
complex. If this is done, then the algorithms produce alignments containing
very large proportions of matching letters and large numbers of gaps.
Although this process achieves optimum score and is mathematically
meaningful, the result of such a process would be biologically meaningless
because insertion and deletion of monomers is relatively a slow evolutionary
process. Dynamic programming algorithms use gap penalties to maximize
the biological meaning. A simple score contains a positive additive
contribution of 1 for every matching pair of letters in the alignment and a gap
penalty is subtracted for each gap that has been introduced (different kinds
of gap penalties are there such as constant penalty, proportional penalty,
affine gap penalty which includes gap opening and gap extension penalty).
The total alignment score is then a function of the identity between aligned
residues and the gap penalties incurred.
Lavenshtein Distance (Edit Distance) and Hamming Distance

Distance treats sequences as points in a metric space. A distance measure is a
function that also associates a numeric value with a pair of sequences, but
with the idea that the larger the distance, the smaller the similarity, and vice
versa. Distance measures usually satisfy the mathematical axioms of a metric.
In most cases, distance and similarity measures are interchangeable in the
sense that a small distance means high similarity, and vice versa.
The process of alignment can be measured in terms of the number of gaps
introduced and the number of mismatches remaining. They are known as the
Lavenshtein distance and Hamming distance, respectively. Lavenshtein
distance (Edit distance) is the minimal number of edit operations required to
change one string (sequence) to the other, where an edit operation is a deletion,
insertion or alteration of a single character in either sequence.
The Hamming distance between two sequences of equal length is the
number of positions with mismatch characters. It is desirable to assign
variable weights to different edit operations since certain changes are more
likely to occur naturally than others.
String is used to signify text (sequence) in Perl program language.
Strings are usually surrounded by single or double quotation marks (e.g. ‘I
am a string’). Given two character string, the distance between them are
measured by Hamming distance and Lavenshtein (edit) distance. A given
sequence of edit operations includes a unique alignment but not vice versa.
Example
agtccgta Hamming distance = 2
ag-tcccgctca Lavenshtein distance = 3
Hamming and Lavenshtein distances measure the dissimilarity of two

sequences: Similar sequences give small distances and dissimilar sequences
give large distances.
High Scoring and Low Scoring Matches

Amino acid substitutions tend to be conservative and the replacement of one
amino acid by another with similar size or physiochemical properties is more
likely to occur than its replacement by another amino acid with very different
properties. Therefore algorithms use different distance measures to compute
and score alignments. Similar sequences give high scores and dissimilar
sequences give low scores. Algorithm for optimal alignment can seek either
to minimize a dissimilarity measure (such as Lavenshtein distance and
Hamming distance) or maximize a scoring function.
Sequence comparison generally involves full length sequences and a
comprehensive alignment requires that many residues have to be placed at
positions that are not strictly identical. For a biologically meaningful
comparison, the positioning of gaps and maximizing the number of identical
matches have to be balanced. To achieve the optimum score, penalties are
introduced to minimize the number of gaps and extension penalties are
added when the gap is extended.
One of the important tasks of sequence analysis is to distinguish
between high-scoring matches that have only mathematical significance and
lower-scoring matches that are biologically meaningful.
Uses
Sequence alignment is useful to discover functional, structural and
evolutionary information in biological sequences. It is important to obtain the
best possible or optimal alignment to discover this information. Sequences
that are very much similar probably have the same function, be it a
regulatory role in the case of similar DNA molecules, or a similar
biochemical function and three-dimensional structure in the case of proteins.
Additionally, if two sequences from different organisms are similar,
there may have been a common ancestor sequence, and the sequences are
said to be homologous. The alignment indicates the changes that could have
occurred between two homologous sequences and a common ancestor
sequence during evolution.
Database similarity searching allows us to determine which of the
hundreds of thousands of sequences present in the database are potentially
related to a particular sequence of interest. The first discovery of similar
sequences was in 1983 when Doolittle and Waterfield found out that viral
oncogene V-sis was found to be a modified form of the normal cellular gene
that encodes platelet-derived growth factor. Dynamic programming algorithms
find the best alignment of two sequences for given substitution matrices and
gap penalties. This process is often very slow.
6.4 SCORING MUTATIONS, DELETIONS AND SUBSTITUTIONS

Due to random mutations, nucleotides may be replaced or deleted or inserted.
Most such mutations result in exchange of one amino acid to that of another
amino acid of very similar physicochemical properties so that the protein is
not affected functionally. Loss of the function of a protein is usually a
disadvantage to the organism.
Hence any change will survive only if it does not have a deleterious effect
on the structure and function of the protein. If the change is very deleterious to
the organism, such mutations will stop spreading in the population since the
organism cannot survive. Therefore, most of the substitution mutations are
well tolerated in the protein. The substitution that does not affect the protein’s
property or function is called conservative substitution.
Usually protein coding genes evolve much more slowly than most other
parts of any genome, because of the need to maintain protein structure and
function. When evolutionary changes do occur in protein sequence, they tend
to involve substitutions between amino acids with similar properties,
because such changes are less likely to affect the structure and function of the
protein.
Protein sequences from within the same evolutionary family usually
show substitutions between amino acids with similar physicochemical
properties. Substitution score matrix is used to show scores for amino acid
substitutions. When comparing proteins, we can increase sensitivity to weak
alignments through the use of a substitution matrix.
Amino Acid Substitution Matrix

Scientists discovered that certain amino acid substitutions commonly occur
in related proteins from different species. Because the protein functions with
these substitutions, the substituted amino acids are compatible with protein
structure and function. Often, these substitutions are to a chemically similar
amino acid, but other changes also occur. Yet other substitutions are
relatively rare. Knowing the types of changes that are most and least common
in a large number of proteins can assist in predicting alignments for any set of
protein sequences.
If related protein sequences are quite similar, they are easy to align, and
one can readily determine the single-step amino acid changes. If ancestor
relationships among a group of proteins are assessed the most likely amino
acid changes that occurred during evolution can be predicted. This type of
analysis was pioneered by Margaret Dayhoff.
Amino acid substitution matrices or symbol comparison tables are used

for such purposes. In these matrices amino acids are listed both across the top
of a matrix and down the side, and each matrix position is filled with a score
that reflects how often one amino acid would have been paired with the
other in an alignment of related protein sequences.
The probability of changing amino acid A into B is always assumed to be
identical to the reverse possibility of changing B into A. This assumption is
made because, for any two sequences, the ancestor amino acid in the
phylogenetic tree is usually not known. Additionally, the likelihood of
replacement should depend on the product of the frequency of occurrence of
the two amino acids and on their chemical and physical similarities. A
prediction of this model is that amino acid frequencies will not change over
evolutionary time.
When calculating alignment scores, identical amino acids should be
given greater value than substitutions and among substitutions conservative
substitutions should be given greater value than nonconservative
substitutions. Two popular matrices-Dayhoff mutation data (MD) and
BLOSUM – have been devised to weight matches between non-identical
residues according to observed substitution rates across large evolutionary
distances. The MD score is based on the concept of Point Accepted Mutation
(PAM).
Percent Accepted Mutation (PAM) Matrix

This matrix lists the likelihood of change from one amino acid to another
homologous protein sequence during evolution. Each matrix gives the
changes expected for a given period of evolutionary time, evidenced by
decreased sequence similarity as genes encoding the same protein diverge
with increased evolutionary time. Thus, one matrix gives the changes
expected in homologous protein that have diverged only a small amount
from each other in a relative short period of time, so that they are still 50% or
more similar. Another matrix gives the changes expected of proteins that
have diverged over a much longer period, leaving only 20% similarity.
They predicted changes are used to produce optimum alignments
between two protein sequences and to score the alignment. The assumption
in this evolutionary model is that the amino acid substitutions observed over
short periods of evolutionary history can be extrapolated to longer distances.
In deriving the PAM matrices, each change in the current amino acid at a
particular site is assumed to be independent of previous mutational events at
that site. Thus, the probability of change of amino acid to another amino acid
is the same, regardless of the previous changes at that site and also regardless
of the position of the first amino acid in a protein sequence.
Amino acid substitutions in a protein sequence are viewed as a Markov
model, characterized by a series of changes of state in a system such that a
change from one state to another does not depend on the previous history of
the state. Use of this model makes it possible to extrapolate amino acid
substitutions observed over a relatively short period of evolutionary time to
longer periods of evolutionary time.
To prepare the Dayhoff PAM mitrices, amino acid substitutions that
occur in a group of evolving proteins were estimated using 1572 changes in 71
groups of protein sequences that were at least 85% similar. Because these
changes are observed in closely related proteins, they represent amino acid
substitutions that do not significantly change the function of the protein.
Hence they are called ‘accepted mutations’, defined as amino acid changes
‘accepted’ by natural selection.
Similar sequences were first organized into a phylogenetic tree. The
number of changes of each amino acid into every other amino acid was then
counted. To make these numbers useful for sequence analysis, information
on the relative amount of change for each amino acid was needed.
Relative mutabilities were evaluated by counting in each group of
related sequences, the number of changes of each amino acid and by dividing
this number by a factor, called the exposure to mutation of the amino acid.
This factor is the product of the frequency of occurrence of all amino acid
changes that occurred in that group per 100 sites. This factor normalizes the
data for variation in amino acid composition, mutation rate, and sequence
length. The normalized frequencies were then summed for all sequence
groups. By these scores, Asn, Ser, Asp, and Glu were the most mutable amino
acids, and Cys and Trp were the least mutable.
The above amino acid exchange counts and mutability values were then
used to generate 20 × 20 mutation probability matrix representing all possible
amino acid changes. Because amino acid change was modeled by a Markov
model, the mutation at each site being independent of the previous
mutations, the changes predicted for more distantly related proteins that
have undergone many (N) mutations can be calculated. By this model, the
PAM1 matrix could be multiplied by itself N times, to give transition
matrices for comparing sequences with lower and lower levels of similarity
due to separation of longer periods of evolutionary history.
One PAM is a unit of evolutionary divergence in which 1% of the amino
acids has been changed (i.e. one point mutation per 100 residues). This does
not mean that after 100 PAMs every amino acid will be different; some
positions may change several times, and some may even revert back to the
original amino acid and some may not change at all. If there was no selection
for fitness, the frequencies of each possible substitution would be primarily
influenced by overall frequencies of the different amino acids (background
frequencies). However, in related proteins, the observed substitution
frequencies (target frequencies) are based toward those that do not seriously
disrupt the protein’s function.
PAM matrices are usually converted into another form, called log-odds
matrices. The odds score represents the ratio of the change of amino acid
substitution by two different hypotheses – one that the change actually
represents an authentic evolutionary variation at that site (the numerator), and
the other that the change occurred because of random sequence variation of no
biological significance (denominator). Odds ratios are converted to logarithms
to give log odds score for convenience in multiplying odds scores of amino
acid pairs in an alignment by adding the logarithms.
Each PAM matrix is designed to score alignments between sequences
that have diverged by a particular degree of evolutionary distance.
Dayhoff and coworkers were the first to use a log-odds approach in
which the substitution scores in the matrix are proportional to the natural log
of the ratio of target frequencies to background frequencies. To estimate the
target frequencies, pairs of very closely related sequences are used to collect
mutation frequencies corresponding to 1PAM, and these data are used to
extrapolate to a distance of 250 PAMs. (Note that PAM matrices are derived
by counting observed evolutionary changes in closely related protein
sequences, and then extrapolating the observed transition probabilities to
longer evolutionary distances). It is possible to derive PAM matrices for any
evolutionary distance but in practice, the most commonly used matrices are
PAM120 and PAM250; of these two, PAM250 matrix produces reasonable
alignments.
Block Amino Acid Substitution Matrices (BLOSUM)

The BLOSUM62 substitution matrix is widely used for scoring protein
sequence alignments. The matrix values are based on the observed amino
acid substitutions in a large set of more than 2000 conserved amino acid
patterns, called blocks. These blocks have been found in a database of protein
sequences representing more than 500 families of related proteins and act as
signatures of these protein families. The BLOSUM matrices are based on an
entirely different type of sequence analysis and a much larger data set than
the Dayhoff PAM Matrices.
The prosite catalog provides lists of proteins that are in the same family
because they have a similar biochemical function. For each family, a pattern
of amino acids that are characteristic of that function is provided. Henikoff
and Henikoff examined each prosite family for the presence of ungapped
amino acid patterns blocks that could be used to identify members of that
family.
To locate these patterns the sequences of each protein family were
searched for similar amino acid patterns by the MOTIF program. These initial
patterns were organized into larger ungapped patterns (blocks) between 3
and 60 amino acid long by the Henikoffs’ PROTOMAT program
(www.blocks. Fhcrc.org). Because these blocks were present in all of the
sequences in each family, they could be used to identify other members of
that family.
The blocks that characterized each family provided a type of multiple
sequence alignment for that family. The amino acid changes that were
observed in each column of the alignment could then be counted. The types of
substitutions were then scored for all aligned patterns in the database and
used to prepare a scoring matrix, the BLOSUM matrix, indicating the
frequency of each type of substitutions. BLOSUM matrix values were given
as logarithms of odds scores of the ratio of the observed frequency of amino
acid substitutions divided by the frequency expected by chance.
The procedure of counting all of the amino acid changes in the blocks,
however, can lead to an over representation of amino acid substitutions that
occur in the most closely related members of each family. To reduce this
dominant contribution from the most alike sequences, these sequences were
grouped together into one sequence before scoring the amino acid
substitutions in the aligned blocks. The amino acid changes within these
clustered sequences were then averaged. Patterns that were 60% identical
were grouped together to make one substitution matrix called BLOSUM60,
and those 80% alike to make another matrix called BLOSUM80, and so on.
The BLOSUM matrices are based on scoring substitutions found over a range
of evolutionary periods.
Like PAM, BLOSUM is based on similar principles of target frequencies
of mutations. BLOSUM makes use of BLOCKS database for deriving the
mutation frequencies and the numbers attached to BLOSUM matrices do not
have the same interpretation as those for PAM matrices. When deriving
matrices in BLOSUM, any bias potentially introduced by counting multiple
contributions from identical residue pairs is removed by clustering sequence
segments on the basis of minimum percentage identity. Here effectively, the
clusters are treated as single sequences. Blocks contain local multiple
alignments of distantly related sequences (as against closely related
sequences used for PAM). BLOSUM has an evolutionary model in its matrix
formation, since it is derived from direct data rather than from extrapolation
values as seen in PAM.
6.5 SEQUENCE ALIGNMENT METHODS

Similarities between sequences can be studied using different methods such
as dotplot method and dynamic programming algorithms such as
Needleman-Wunsch algorithm and the Smith-Waterman algorithm and
word or k-tuple methods such as used by FASTA and BLAST programs.
Alignment of two sequences (pairwise alignment) is performed using the
following methods:
(i) Dot matrix analysis
(ii) The dynamic programming (DP) algorithm
(iii) Word or k-tuple methods such as used by FASTA and BLAST
programs.
Alignment of three or more than three sequences is done using multiple
sequence alignment methods. Some of the methods are: (i) Profiles, (ii) Blocks,
(iii) Fingerprints, (iv) PSI-BLAST and (v) Hidden Markov Models (HMMs).
6.6 PAIRWISE ALIGNMENT

When the sequence alignment aligns two sequences one below the other and
scores the similarities, it is referred to as pairwise alignment. The challenge in
pairwise sequence alignment is to find the optimum alignment of two sequences
with some degree of similarity. Various computer programs assist in this.
Dot Matrix
A dot matrix analysis is primarily a method for comparing two sequences to
look for possible alignment of characters between the sequences. The method
is also used for finding direct or inverted repeats in protein and DNA
sequences, and for predicting regions in RNA that are self-complementary
and that have potential of forming secondary structure.
The major advantage of the dot matrix method for finding sequence
alignments is that all possible matches of residues between two sequences are
found, leaving the investigator the choice of identifying the most significant
ones. Then sequences of the actual region that align can be detected by using
other methods of sequence alignment, e.g. dynamic programming.
Alignments generated by these programs can be compared to the dot matrix
alignment to find out whether the longest regions are being matched and
whether insertions and deletions are located in the most reasonable places.
Detection of matching regions may be improved by filtering out
random matches in a dot matrix. Filtering is achieved by using a sliding
window to compare the two sequences at the same time. Identification of
sequence alignments by the dot matrix method can be aided by performing a
count of dots in all possible diagonal lines through the matrix to determining
statistically which diagonals have the most matches, and comparing these
match scores with the results of random sequence comparison.
Dot matrix analysis can also be used to find direct and inverted repeats
within sequences. Repeated regions in whole chromosomes may be detected.
Direct repeats may also be found by performing sequence alignments with
dynamic programming methods. A dot matrix analysis can also reveal the
presence of repeats of the same sequence character.
Dot matrix method displays any possible sequence alignments as
diagonals on the matrix. Dot matrix analysis can readily reveal the presence
of insertions/ deletions and direct and inverted repeats that are more
difficult to find by the other, more automated methods.
Dotplot is a simple visual approach to compare two sequences. It is a
table or matrix. It gives quick pictorial statement of the relationship between
two sequences. The two sequences to be compared are plotted on the X and Y
axis of a graph. Wherever a base or residue of one axis coincides with a base
or residue on the other axis, it is marked with a dot. The plot is characterized
by some apparently random dots and a central diagonal line where a high
density of adjacent dots indicates the regions of greatest similarity between the
two sequences (Fig. 6.3).
MTFRDLLSVSFEGPRPDSSAGGSSAGG
M X
T X
F X X
R X X
D X X
L XX
L XX
S X X XX XX
V X
S X X XX XX
F
X X
E
G X
X XX XX
P
R X X
P X X
D X
S X X
S X X XX XX
A X X XX XX
G X X
G X XX XX
X XX XX
Fig. 6.3 Illustration of the manner of construction of the dotplot matrix, using a simple
residue identify matrix to score an ‘X’ where a pair of identical residues is observed.
(Source: Atwood, T.K. and Parry-Smith, D.J., Introduction to Bioinformatics, Pearson
Education Ltd., 2001)
Dynamic Programming
Dynamic programming is a computational method that is used to align two
protein or nucleic acid sequences. The method is very important for sequence
analysis because it provides the very best alignment or optimal alignment
between sequences.
The method compares every pair of characters in the two sequences and
generates an alignment. This alignment will include matched and
mismatched characters and gaps in the two sequences that are positioned so
that the number of matches between identical or related characters is the
maximum possible. The dynamic programming algorithm provides a reliable
computational method for aligning DNA and protein sequences. Both global
and local types of alignments may be made by simple changes in the basic
dynamic programming algorithm.
A global alignment program is based on the Needleman-Wunsch
algorithm and a local alignment program is based on the Smith-Waterman
algorithm. Another feature of the dynamic programming algorithm is that the
alignments obtained depend on the choice of a scoring system for comparing
character pairs and penalty scores for gaps. For protein sequences, the simple
system of comparison is one based on identity. A match in an alignment is
only scored if the two aligned amino acids are identical.
The dynamic programming method, first used for global alignment of

sequences by Needleman and Wunsch and for local alignment by Smith and
Waterman, provides one or more alignments of the sequences. An alignment
is generated by starting at the ends of the two sequences and attempting to
match all possible pairs of characters between the sequences and by
following a scoring scheme for matches, mismatches and gaps.
This procedure generates a matrix of number that represents all
possible alignments between the sequences. The highest set of sequential
scores in the matrix defines an optimal alignment. The dynamic
programming method is guaranteed in a mathematical sense to provide the
optimal alignment for a given set of user-defined variables, including choice
of scoring matrix and gap penalties.
In the global alignment of sequences using Needleman-Wunsch program
in the dynamic programming method, the optimal score at the matrix position
is calculated by adding the current match score to previously scored positions
and subtracting gap penalties. Each matrix position may have a positive or
negative score, or O.
The Needleman-Wunsch algorithm will maximize the number of matches
between the sequences along the entire length of the sequences. Gaps may also
be present at the end of sequences, in case there is extra sequence left over after
the alignment. These end gaps are often but not always, given a gap penalty.
A local sequence alignment giving the highest-scoring local match
between two sequences using Smith-Waterman program in the dynamic
programming method gives more meaningful matches than global matches.
Patterns that are conserved in the sequences are highlighted. A local
alignment tends to be shorter and may not include many gaps.
Using a distance scoring scheme, dynamic programming method could
be used to provide an alignment that highlights the evolutionary changes.
This method scores alignments based on differences between sequences and
sequence characters, i.e., how many changes are required to change one
sequence into another. The greater the distance between sequences, the
greater the evolutionary time that has elapsed since the sequences diverged
from a common ancestor.
The first step in global alignment dynamic program is to create a matrix
with M+1 columns and N+1 rows, where M and N correspond to the size of
the sequence to be aligned. The next step is to score (Matrix fill) and the next
step is to align (Trace back).
Procedure
Go to the ncbi-entrez site (www.ebi.ac.uk/align). Once the home page appears
select the method local or global. Paste the sequence of interest in the text box.
Then press RUN button.
Word or k-Tuple
The word or k-tuple methods are used by FASTA and Blast algorithms. They
align two sequences very quickly, first by searching for identical short
stretches of sequences called words or k-tuples and then by joining these
words into an alignment by the dynamic programming method. These
methods are fast enough to be suitable for searching an entire database for
the sequence that aligns best with an input test sequence. The FASTA and
BLAST methods are heuristic, i.e., an empirical method of computer
programming in which rules of thumb are used to find solutions and
feedback is used to improve performance.
In database searching, the basic operation is to align the query sequence
to each of the subject sequence in the database and if this can be done in a
faster manner, then this is better than dynamic programming algorithm
methods.
FASTA
FASTA is a DNA and protein sequence alignment software package. It was
first described by David J. Lipman and William R. Pearson in 1985 as FASTP
dealing with only protein sequences. In 1988 the ability to search DNA
sequences was added.
Procedure:
Open the internet browser and type the URL address: http://
fasta.adbj.nig.ac.jp/top.e.html. The results can be received in any Email
address.
FASTA compares nucleotide sequence with nucleotide sequence
database or amino acid sequence with amino acid sequence database. It
compares nucleotide sequence with amino acid sequence database by
translating the sequence taking into account all six possible open reading
frames. It compares amino acid sequence with nucleotide sequence database
by translating database sequences taking into account all six possible open
reading frames.
It compares amino acid sequence with nucleotide sequence database by
translating database sequence taking into account all six possible open
reading frames and frame-shift mutations.
We must specify the database in which homologous sequences are
searched. We must specify the division in which homologous sequences are
searched. We must specify how many homologous sequences are reported in
the list of homology scores. Default value is 100. We must specify how many
alignments with homologous sequences are reported. Default value is 100. We
must specify the degree of sensitivity (Ktup) of the search. Usually the Ktup
value is recommended to be set at 3-6 for nucleotide sequences and 1-2 for
amino acid sequences. Lesser the ktup value, more sensitive the search. The
k-tupl value determines how many consecutive identities are required for a
match to be declared.
FASTA program achieves a high level of sensitivity for similarity

searching at high speed. FASTA uses optimized local alignment and
substitution matrix for its sensitivity. First FASTA prepares a list of words
from the pair of sequences to be matched. The word is nothing but
3-6 nucleotides or 1 or 2 amino acids. It uses non-overlapping words. It
matches the words and makes a count of it.
Similar to dot matrix plotting and scoring, it creates the word diagonal
and finds a high scoring match. The output is labeled as unit1. Only if the
score is sizable it proceeds to the second level. In the second level, for every
best hit of words, it looks for neighboring approximate hits and if the score
value is good, it collects the short segments of unit1 and prepares a larger dot
matrix diagonal and scores after including gap size and gap penalty.
The best score from this second level scoring is called initn. The initn
scores are saved for each comparison of a query sequence with a database
sequence. After all the database sequences are tested, the sequences that
produce best initn scores are used to produce local alignment using Smith-
Waterman algorithm, to give the opt score.
FASTA format contains a cue line header followed by lines of sequence
data. Sequences in FASTA formatted files are preceded by a line starting with
a ‘>’ symbol. The first word on this line is the name of the sequence, and the
rest of the line is a description of the sequence. The remaining lines contain
the sequence itself. Blank lines in FASTA file are ignored and so are spaces or
other gap symbols in a sequence. FASTA lines containing multiple sequences
are just the same with one sequence listed next to the other. This format is
accepted for many multiple sequence alignment programs.
BLAST
BLAST (Basic Local Alignment Search Tool) program was developed by
Altschul et al. in 1990. It has become very popular because of its efficiency
and firm statistical foundation. BLAST works under the assumption that
high-scoring alignments are likely to contain short stretches of identical or
near identical letters. These short stretches are called words.
The first step in BLAST is to look for words of a certain fixed word length
W that score higher than a certain threshold score (T). The value of W is
normally 3 for protein sequences or 11 for nucleic acid sequences. BLAST takes
a word from the query sequence initially and proceeds to extend the query
sequence on either direction on the target sequence with totalling scores for
matchings, mismatchings, gap introduction and extension of gap. The
extension will continue to reach a cut off value S. BLAST extends individual
word matches until the total score of the alignment falls from its maximum
value by a certain amount producing high scoring segment pairs.
BLAST is a heuristic search algorithm employed by different BLAST
programs such as BLASTP, BLASTN, BLASTX, TBLASTX and PSI-BLAST.
BLASTP compares an amino acid query sequence against a protein sequence
database. BLASTN compares a nucleotide query sequence against a nucleotide
sequence database. BLASTX compares six-frame conceptual translation
products of nucleotide query sequence (both strands) against a protein
sequence database. TBLASTN compares a protein query sequence against a
nucleotide sequence database dynamically translated in all six reading frames
(both strands). TBLASTX compares the six-frame translations of a nucleotide
query sequence against the six-frame translations of a nucleotide sequence
database. PSI-BLAST compares amino acid query sequence against a protein
sequence database.
The FASTA and BLAST programs are essentially local similarity search
methods that concentrate on finding short identical matches, which contribute
to a total match.
6.7 MULTIPLE SEQUENCE ALIGNMENT

A multiple sequence alignment is an alignment that contains more than two
sequences. Analysis of groups of sequences that form gene families requires
the ability to make connections between more than two members of the
group, in order to reveal subtle conserved family characteristics.
The goal of multiple sequence alignment is to generate a concise,
information-rich summary of sequence data in order to make decisions on
the relatedness of sequences to a gene family. Multiple alignment is more
informative about evolutionary conservation. To be informative a multiple
alignment should contain a distribution of closely and distantly related
sequences.
In multiple sequence alignment, sequences are aligned optimally by
bringing together the greatest number of similar characters into register in
the same column of the alignment. Multiple sequence alignment of a set of
sequences can provide information about the most similar regions in the set.
In proteins such regions may represent conserved functional or structural
domains.
If the structure of one or more members of the alignment is known, it may
be possible to predict which amino acids occupy the same spatial relationship
in other proteins in the alignment and which genes occupy sites in nucleic
acids. Multiple sequence alignment is also used for the prediction of specific
probes for other members of the same group or family of similar sequences in
the same or other organisms.
There are many methods to carry out multiple sequence alignment such
as Profiles, Blocks, fingerprints, etc. Profiles, for example, use a weight matrix
approach to summarize the whole alignment. Blocks, for example, seeks out
conserved, un-gapped blocks of residues within alignments, which are then
converted to position-specific scoring matrices.
Fingerprints for example, manually extracts highly specific, relatively
short un-gapped motifs from alignments and uses them to generate un-
weighted scoring matrices. All these methods use techniques such as aligning
all pairs of sequences, aligning sequences in arbitrary order or aligning

sequences following the branching order of a phylogenetic tree.
The power of multiple sequence analysis lies in the ability to draw
together related sequences from various species and express the degree of
similarity in a relatively concise format. There are many multiple alignment
databases which are accessible via the web.
The key steps in building a multiple alignment are:
(i) Find the sequence to align by database searching or by other means
(ii) Locate the region of each sequence to be included in the alignment.
(iii) Assess the similarities within the set of sequences by comparing them
pairwise with randomizations.
(iv) Run the multiple alignment program
(v) Manually inspect the alignment for problems
(vi) Remove sequences that appear to disrupt the alignment seriously and
then realign the remaining subset.
(vii) After identifying key residues in the set of sequences that are
straightforward to align, attempt to add the remaining sequences to
the alignment so as to preserve the key features of the family.
Methods
Many methods are available for applying multiple sequence alignments of
known proteins to identify related sequences in database searches. Some
important methods are: profiles, Blocks, Fingerprints, PSI-BLAST and
Hidden Markov Models (HMMs).
Profiles
Proteins of similar function usually share identical motif. Therefore, most
prediction is more useful than trying to find similarity in entire sequence of
the protein. Proteins of similar or comparable function are usually siblings of
a common ancestral protein. Often they share some amount of similarity in
the sequence, particularly in the motifs. A sequence alignment usually
supplies us such families of proteins. Such kind of multiple alignments is often
called profiles.
A profile expresses the patterns inherent in a multiple sequence
alignment of a set of homologous sequences. They have several applications:
• They permit greater accuracy in alignments of distantly-related
sequences.
• Sets of residues that are highly conserved are likely to be part of the
active site, and give clues to function.
• The conservation patterns facilitate identification of other homologous
sequences.
• Patterns from the sequences are useful in classifying subfamilies
within a set of homology.
• Sets of residues that show little conservation, and are subject to
insertion and deletion, are likely to elicit antibodies that will cross-
react well with the native structure.
• Most structure-prediction methods are more reliable if based on a
multiple sequence alignment than on a single sequence. Homology
modeling, for example, depends crucially on correct sequence
alignment.
To use profile patterns to identify homologs, the basic idea is to match the
query sequences from the database against the sequences in the alignment
table, giving higher weight to positions that are conserved than to those that
are variable.
In the profiles database, there is a distilling of the sequence information
available within complete alignments into scoring tables or profiles. Profiles
define which residues are allowed at given positions, which positions are
highly conserved and which degenerate; and which positions or regions can
tolerate insertions.
Once multiple sequence alignment is performed, a portion of the
alignment which is highly conserved is then identified and a type of scoring
matrix called a profile is produced. A profile includes scores for amino acid
substitutions and gaps (matches, mismatches, insertions, deletions) in each
column of the conserved region so that an alignment of the region to a new
sequence can be determined.
BLOCKS
The blocks concept is derived from motif, the conserved stretch of amino acids
that confer specific function or structure to the protein. If motifs of a protein
family are aligned without introducing gaps in the sequences, we get blocks.
In the BLOCKS database, conserved motifs, or blocks, are located by
searching for spaced residue triplets and a block score is calculated using the
BLOSUM62 substitution matrix. The validity of blocks found by this method is
confirmed by the application of second motif-finding algorithm, which
searches for the highest-scoring set of blocks that occur in the correct order
without overlapping. Blocks within a family are converted to position-
specific matrices which are used to make independent database searches.
Like the profiles, blocks represent conserved regions in the multiple
sequence alignment. Blocks differ from profiles in lacking insert and delete
positions in the sequences. Every column includes only matches and
mismatches (Substituted position without gaps).
Fingerprints
Within a sequence alignment, it is unusual to find not one, but several motifs
that characterize the aligned family. Diagnostically, it makes sense to use
many or all of the conserved regions to create a signature or fingerprint, so
that in a database search, there is a higher chance of identifying a distant
relative, whether or not all parts of the signature are matched. Protein
fingerprints are groups of motifs that represent the most conserved regions of
multiple sequence alignments.
PSI-BLAST
PSI-BLAST (Position Specific Iterated –BLAST) incorporates elements of both
pairwise and multiple sequence alignment methods. Following an initial
database search, PSI-BLAST allows automatic creation of position-specific
profiles from groups of results that match the query above a defined
threshold. Running the program several times can further refine the profile
and increase search sensitivity.
HMMs
Hidden Markov Models (HMMs) is a statistical model that considers all
possible combinations of matches, mismatches, and gaps to generate an
alignment of a set of sequences. A localized region of similarity, including
insertions and deletions, may also be modeled by an HMM.
HMMs are probabilistic models consisting of a number of
interconnecting states: they are essentially linear chains of match, delete or
insert states, which can be used to encode sequence conservation within
alignments. HMMs are the basis of the Pfam database.
A HMM is a computational structure for describing the subtle patterns
that define families of homologous sequences. HMMs are powerful tools for
detecting distant relatives, and for prediction of protein folding patterns.
HMMs include the possibility of introducing gaps into the generated
sequence, with position-dependent gap penalties and they carry out the
alignment and the assignment of probabilities together.
Automatic Alignment
Central to sequence analysis is the multiple alignment. Consequently a vital
tool for the sequence analyst is an alignment editor. Several automatic
alignment programs are available now, either in a stand-alone form (such as
ClustalW) or as components of larger packages (such as Pileup in GCG). But
automatically calculated alignments almost invariably require some degree of
manual editing, whether to remove spurious gaps, to rescue residue windows,
or to correct misalignments. This often presents problems, as there is currently
no standard format for alignments.
Consequently, swapping between alignment programs is almost
impossible without the use of ad hoc scripts to convert between disparate
input and output formats. The advent of the object-oriented network
programming language, Java, addresses some of these problems. Java capable
browsers may run applets on a variety of platforms - applets are small
applications bonded from a server via HTML pages; the software is loaded on-
the-fly from the server and cached for that session by the browser.
CLUSTAL
CLUSTAL performs a global multiple sequence alignment using the following
steps:
(i) Perform pairwise alignments of all of the sequences
(ii) Use the alignment scores to produce a phylogenetic tree
(iii) Align the sequences sequentially, guided by the phylogenetic
relationships indicated by the tree.
CLUSTAL approach exploits the fact that similar sequences are likely to
be evolutionarily related. It aligns sequences in pairs, following the
branching order of a family tree. Similar sequences are aligned first and more
distantly related sequences are added later. Once pairwise alignment scores
for each sequence relative to all others have been calculated, they are used to
cluster the sequences into groups which are then aligned against each other
to generate the final multiple alignment.
CLUSTAL has been revised many times. CLUSTAL W uses the
positioning of gaps in closely related sequences to guide the insertion of gaps
into those that are more distant. Similarly, information compiled during the
alignment process about the variability of the most similar sequences is used
to help vary the gap penalties on a residue and position specific basis.
CINEMA
CINEMA is a Colour Interactive Editor for Multiple Alignments, written in
Java: the program allows creation of sequence alignments by hand,
generation of alignments automatically (e.g. using ClustalW), and
visualization and manipulation of sequence alignments currently resident at
different sites on the Internet. In addition to its special advantage of allowing
interactive alignment over the web, CINEMA provides links to the primary
data sources, thereby giving ready access to up-to-date data, and a gateway
to related information on the Internet.
CINEMA is more than just a tool for colour-aided alignment preparation.
The program also offers facilities for motif modification; database searching
(using BALST); 3D-structure visualization (where co-ordinates are available),
allowing inspection of conserved features of alignments in a 3D context;
generation of dotplots and hydropathy profiles; six-frame translation; and so
on. The program is embedded in a comprehensive help-file (written in HTML)
and is accessible both as a stand-alone tool from the DbBrowser Bioinformatics
Web Server, and as an integral part of the PRINTS protein fingerprint
database.
READSEQ
READSEQ is a very useful sequence format conversion tool. D.G. Gilbert from
the Biology Department of Indiana University, USA programmed this in 1990
to read the formatted sequence files and convert the sequence information in
the files into another file that has a different format. It automatically detects
many sequence formats (FASTA/Pearson, Intelligenetics/Stanford, GenBank,

NBRF, EMBL, GCG, DNA Strider, Fitch, PHYLIP V3.3, V3.4, PIR or CODATA,
MISF, ASN1 and PAUP NEXUS) and inter-converts them.
Procedure
Open the internet browser and type the URL address: http://www.bimas.cit
.nih.gov/molbio.readsec/. Pull the drop down menu and select the desired
format. Paste the sequence in the text box. Press SUBMIT or RUN button.
6.8 ALGORITHMS FOR IDENTIFYING DOMAINS WITHIN A

PROTEIN STRUCTURE
Zehfus (1994) proposed a method for identification of discontinuous domains
based on their ‘compactness’. The PUU (Parser for protein Unfolding Units)
algorithm attempts to maximize the interactions within each unit (domain)
while minimizing interactions between units. If a molecular dynamics
simulation is carried out on a molecule, the residues that have the most
correlated motion are likely to be part of a domain.
Therefore, a harmonic model is used to approximate inter domain
dynamics. Differences in fluctuations times can be used for domain
decomposition. However, a chain can cross over several times between units.
To solve this problem the residues are grouped by solving an eigenvalue
problem for the contact matrix – this reduces the problem to a one-
dimensional search for all reasonable trial bisections. Physical criteria are
used to identify units that could exist by themselves.
The DOMAK algorithm calculates a ‘split value’ from the number of
each type of contact when the protein is divided arbitrarily into two parts.
This split value is large when the two parts of the structure are distinct. The
detective procedure for domain identification is based on the assumption that
each domain should contain an identifiable hydrophobic core. However, it is
possible that hydrophobic cores from different domains continue through the
interface region. In this algorithm core residues are defined as those residues
that occur in a regular secondary structure and have buried side chains that
form predominantly nonpolar contacts with one another.
An algorithm based on dividing the chain to minimize the density of
inter-domain contacts has also been proposed. Another algorithm based on
cluster analysis of secondary structure has also been suggested for
identification of domains in protein structures. A consensus method that was
based on the assignments from the three independent algorithms for domain
recognition (Detective, PUU and DOMAK) was found to give better accuracy
than any of the individual algorithms that were tested.
Strudl (STRUctural domain limits) uses a Kernighan-Lin graph heuristic
to partition the protein into residue sets that display minimal interactions
between the sets. The graph specifies the connectivity between the nodes and
is represented by matrix. Starting from a reasonable partition the algorithm
minimizes a cost function that is based on interactions between the nodes.
This is carried out by swapping pains of nodes until an optimal partitioning
is obtained. The interactions are deduced from the weighted Voronoi
diagram.
6.9 ALGORITHMS FOR STRUCTURAL COMPARISON

Wide variety of methods that make use of graph theory, distance matrices,
dynamic programming, Monte Carlo, molecular dynamics maximum
likelihood criteria, etc. have been proposed.
Double dynamic programming method for structure alignment requires
two matrices. In protein structure comparison by alignment of distance
matrices (DALI) the three-dimensional coordinates of each protein are used
to calculate residue-residue distance matrices. The distance matrices are
decomposed into elementary contact patterns, e.g. hexapeptide-hexapeptide
similarities. Similar contact patterns are paired and combined into larger
consistent sets of pairs. The alignments are evaluated by defining a similarity
score. Unmatched residues do not contribute to the overall score. The
primary advantage of this method is that it does not depend on the
topological connectivity of the aligned segments. In addition, this algorithm
tolerates sequence gaps of any length and chain reversals. It is fully
automated and all structural classes can be treated with the same set of
parameters.
Combinatorial Extension of optional path (CE) algorithm is based on
the concept of an aligned fragment pair (AFP). An aligned fragment pair
consists of two structurally similar fragments, one from each structure. The
similarity is defined based on local geometry and not on global features such
as orientation of secondary structures or topology. If a combination of AFPs
represents a continuous alignment path, an attempt is made to extend it
further; otherwise it is discarded. By considering different combinations of
AFPs in this manner, a single optimal alignment is created.
Vector alignment search tool (VAST) is used for pairwise structural
alignment. A unit of tertiary structural similarity is defined as pairs of
secondary structural elements (SSE) that have similar type, relative orientation
and connectivity. In comparing two domains, the sum of the superposition
scores across these units is calculated.
6.10 CARRYING OUT A SEQUENCE SEARCH

One of the important aims of bioinformatics is the prediction of protein
function, and ultimately of structure, from the linear amino acid sequence.
Given a newly determined sequence, one wants to know: what is this protein?
To what family does it belong? What is its function? And how can one explain
its function in structural terms? By searching secondary databases, which
store abstractions of functional and structural sites characteristic of particular
proteins, one can recognize patterns that allow one to infer relationships with
previously characterized families. Similarly, by searching fold libraries, which
contain templates of known structures, it is possible to recognize a previously
characterized fold.
Given the size of existing sequence databases, it is likely that searches
with new sequences will uncover homologues; and, with the expansion of
sequence pattern and structure template databases, the chances of assigning
functions and inferring possible fold families are also improving. However,
these advances in sequence and fold pattern recognition methods have not
yet been matched by similar advances in prediction techniques. So if one
cannot predict function or structure directly from sequence, but can identify
homologues and recognize sequence and fold patterns that have already
been seen, given the bewildering array of databases to search, how does one
use this information to build a sensible search method for novel sequences?
Essentially, one has to check identical matches and then move on to
search for closely similar sequences in the primary databases. The strategy
then involves searching for previously characterized sequence – and, where
possible, fold patterns in a variety of pattern databases. The final step is the
integration of results from all these searches to build a consistent family/
functional/structural diagnosis. An interactive www tutorial, known as
BioActivity can be found at: http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/prefacefrm.html
The first and fastest test to identify an unknown protein sequence
fragment is to perform an identity search, preferably of a composite sequence
database. OWL is a composite resource that can be queried directly by means
of its query language. Identity searches, which are suitable for peptides up to
30 residues in length, are possible via web interface; this provides an easy-to-
use form that conveniently shields the user from the syntax of the query
language. An identity search will reveal in a matter of seconds whether an
exact match to the unknown peptide already exists in the database. The
following website is useful. http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/ nucleicfrm.html
If an identity search fails to find a match, the next step is to look for
similar sequences again preferably in a composite database. For best results it
is recommended to perform similarity searches on peptides that are longer
than 30 residues (shorter the peptide, the greater the likelihood of finding
chance matches that have no biological relevance). In most applications as
much sequence information as possible should be used in a BLAST search
(although this can lead to complications in interpreting output from searches
with multi-domain or modular proteins).
There are several important features to note in the BLAST output. First,
one is looking for matches that have high scores with correspondingly low
probability values. A very low probability indicates that a match is unlikely to
have arisen by chance. As the probability values approach unity, they are
considered more and more likely to be random matches. The second feature
of interest is whether the results show a cluster of high scores (with low
probabilities) at the top of the list, indicating a likely relationship between the
query and the family of sequences in the cluster.
Heuristic search tools like BLAST do not always give clear-cut answers.
Frequently the program will not be able to assign significant scores to any of
its retrieved matches, even if a biologically relevant sequence appears in the
hit-list. Such search tools do not have the sensitivity always to fish out the
right answer from the vast amount of sequences in the primary database;
rather, they cast a coarse net, and it is then up to the user to pick out the best.
Under these circumstances, where no individual high-scoring sequence
or cluster of sequences, is found, the third feature to consider is whether
there are any observable trends in the type of sequences matched, i.e. do the
annotations suggest that several of these are from a similar family? If there
are possible clues in the annotations, the next step is to try to confirm these
possibilities both by reciprocal BLAST searches (do retrieved matches
identify the sequence in a similarity search?), and by comparing results from
searches of the secondary databases.
The first secondary database to consider is PROSITE. Within the tutorial, this
is accessible for searching via the ‘Protein sequence analysis-Secondary
database searches page: http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/protein1frm.html
The database code is simply supplied to the relevant part of the form
and the option to exclude patterns with a high probability of occurrence (i.e.
rules) is switched on.
The next step is to search the ISREC profile library. In addition to the
profiles that have already been incorporated into the main body of PROSITE,
the web server offers a range of pre-release profiles that have not yet been
sufficiently documented for release through PROSITE. Searching the complete
collection of profiles is achieved, once again, by simply supplying the database
code to the web form, remembering to change the format button from the
default (plain text) to accept a SWISS-PROT ID: http://www.bioinf.man.ac.uk/
dbbrowser/bioactivity/protein1frm.html
Another important resource to search is the Pfam collection of Hidden
Markov Models. Searching is achieved via web interface that requires the
query sequence to be supplied to a text box: http://www.bioinf.man.ac.uk/
The sequence must be in FASTA format, which means that the query
must be preceded by the > symbol and a suitable sequence name.
Another key secondary resource is PRINTS, which provides a bridge
between single-motif search methods, such as the one used to compile
PROSITE, and domain-alignment/profile methods, such as those embodied in
the profile library and Pfam. PRINTS is accessible for searching via the
‘Protein sequence analysis – protein fingerprinting’ page: http://
www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein2frm.html
The output is divided into distinct sections; first, the program offers an
intelligent ‘guess’ based on the occurrence of the highest-scoring complete or
partial fingerprint match or matches; it then provides an expanded calculation
that shows the top 10 best-scoring matches-clearly; these include the
intelligent results from the previous analysis, but the additional matches are
provided to highlight why the best guess was chosen, and to allow a different
choice, if the guess is considered either to be wrong or to have missed
something; the remaining sections of output provide more of the new data,
again allowing the users to search for anything that might have been missed.
A particularly valuable aspect of this software is the facility to visualize
individual fingerprint matches by clicking on the graphic box.
The next secondary resources to be searched are the BLOCKS database,
derived from PROSITE and PRINTS. If results matched in PROSITE and/or
PRINTS are true-positive, then we would expect these to be confirmed by the
BLOCKS search results. The BLOCK databases are searched by supplying the
query sequence to the input box of the relevant web form: http://
www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein1frm.html
One must remember in each case to switch to the required database.
The accession codes in the Block column indicate the number of motifs;
matches to these motifs are ranked according to score. The ‘rank’ of the best-
scoring block, the so-called anchor block is reported. Where additional blocks
support the anchor block by matching with high scores in the correct order, a
probability value is calculated, reflecting the likelihood of these matches
appearing together in an order. Often results are littered with matches with
high–scoring individual blocks. These matches are usually the result of
chance, and p-values are not calculated. The information content of particular
blocks can be visualized by examination of the sequence logo.
A sequence logo is a graphical display of a multiple alignment consisting
of colour-coded stacks of letters representing amino acids at successive
positions. The height of a given letter increases with increasing frequency of
the amino acid, and its height increases with increasing conservation of the
aligned position; hence, letters in stacks with single residues (i.e. representing
conserved positions) are taller than those in stacks with multiple residues (i.e.
where there is more variation).
Within stacks, the most frequently occurring residues are not only taller,
but also occupy higher positions in the stack, so that the most prominent
residue at the top is the one predicted to be the most likely to occur at that
position. To address the problem of sequence redundancy within block, which
strongly biases residue frequencies, sequence weights are calculated using a
position-specific scoring matrix (PSSM). This reduces the tendency for over-
represented sequence to dominate stacks, and increases the representation of
rare amino acids relative to common ones.
The final resource is IDENTIFY, which is searched by supplying the
query sequence to the relevant web form: http://www.bioinf.man.ac.uk/
We can find out more about the structure, either by following the links
embedded in the PROSITE and PRINTS entries or by supplying a relevant
PDB code in the query forms of the structure classification resources (such as
SCOP and CATH). SCOP is accessible for searching via the ‘protein structure
analysis–structure classification resources’ page:
http://www.bioinf.man.ac.uk/dbbrowser/bioactivity/structurefrm.html
The CATH resource is queried by supplying the desired PDB html code
to the relevant form on the same web page. Clicking on the hyper linked PDB
code in the CATH summary takes to the PDBsum resource, a web based
collection of information for all PDB structures. The picture of the overall
fold and secondary structure of the molecule is available here. Using this
pictorial information, one can begin to rationalize the results of the secondary
database searches in terms of structural and functional features of the 3D
molecule, essentially by superposing the motifs matched in PROSITE,
PRINTS and BLOCKS on to the sequence.
STUDY QUESTIONS
1. What is sequence alignment?
2. What are the goals of sequence alignment?
3. What are the types of sequence alignments?
4. How is dotplot analysis performed?
5. How is pairwise comparison done?
6. How mutations, deletions and substitutions are scored?
7. Which programs are used for pairwise database searching?
8. What is multiple sequence alignment?
9. Enumerate the key steps in building multiple alignment
10. Which are the programs used in multiple alignment?
11. How can one carry out a sequence search?
12. What is a string?
13. What is Hamming distance?
14. What is Lavenshtein (edit) distance?
C H A P T E R
Predictive Methods using DNA

7
and Protein Sequences
Since sequencing whole genomes has been achieved with greater ease today,
deriving biological meaning from the long sequences of nucleotides that are
obtained through sequencing becomes a crucial biological research problem.
Annotation is a word that is commonly used today to mean ‘deriving useful
biological information’ from raw elements in genomic DNA (structural
annotation) and then assigning functions to these sequences (functional
annotation).
With the advent of whole-genome sequencing projects, there is
considerable use for computer program that scan genomic DNA sequences to
find genes, particularly those that encode proteins. Once a new genome
sequence has been obtained, the most likely protein-encoding regions are
identical and the predicted proteins are then subjected to a database similarity
search.
Prediction is an important component of bioinformatics. Assignment of
structures to gene products is a first step in understanding how organisms
implement their genomic transformation. Prediction helps to understand the
structures of the molecules encoded in a genome, their individual activities
and interactions and the organization of these activities and interactions in
space and time during the lifetime of the organism.
7.1 GENE PREDICTION STRATEGIES

Because the proteins present in a cell largely determine cell shape, role and
physiological properties, one of the first orders of business in genome analysis
is to determine the polypeptides encoded by an organism’s genome. To
determine list of polypeptides, the structure of each mRNA encoded by the
genome must be deduced.
Bioinformatics uses several independent sets of information to predict
the most likely sequence for mRNA and polypeptide coding regions. The sets
of information are: cDNA sequences, Docking site sequences marking the start
and end points for transcription, pre-mRNA splicing and translation,
sequences of related polypeptides, and species-specific usage preferences for
some codons over others encoding the same amino acid.
Figure 7.1 depicts how different sources of information are combined to
create the best possible mRNA predictions. Predictions of mRNA and
polypeptide structure from genomic DNA sequence depend on an integration
of information from cDNA sequence, docking site predictions, polypeptide
similarities and codon bias.
Categories
Gene finding strategies can be grouped into three categories, namely, content-
based, site-based and comparative. Content-based methods rely on the overall,
bulk properties of a sequence in making a determination. Characteristics
considered here include how often particular codons are used, the periodicity
of repeats, and the compositional complexity of the sequence. Because different
organisms use synonymous codons with different frequency, such clues can
provide insight into determining regions that are more likely to be exons.
Predictions Blast similarity

from protein
Codon bias
Predictions Sequence
from mRNA motif
and its EST
properties
cDNA
Predictions
from docking Promoter Splice Translation Splice Translation Polyadenylation
site analysis site sites termination sites termination site
programs site site
5¢ UTR Open reading frame

(ORF) 3¢ UTR
Intron Exon Exon Intron

Intron Exon
Predicted gene
Fig. 7
7..1 The different forms of gene product evidence – cDNAs, ESTs, BLAST similarity hits,
codon bias, and motif hits – are integrated to make genes predictions. Where multiple
classes of evidence are found to be associated with a particular genomic DNA sequence,
there is a greater confidence in the likelihood that a gene prediction is accurate. (Source:
A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Predictive Methods using DNA and Protein Sequences 7.3
Site-based methods focus their attention to the presence or absence of a
specific sequence, pattern, or consensus. These methods are used to detect
features such as donor and acceptor splice sites, binding sites for transcription
factors, poly A tracts and start and stop codons. Comparative methods make
determinations based on sequence homology. Hence translated sequences are
subjected to database searches against protein sequences to determine whether
a previously characterized coding region correspond to the region in the query
sequence.
The simplest method of finding DNA sequences that encode proteins is
to search for open reading frames. An ORF is a length of DNA sequence that
contains a contiguous set of codons, each of which specifies an amino acid.
Prediction of RNA Secondary Structure

Sequence variation in RNA sequences maintain base-pairing patterns that give
rise to double stranded regions (secondary structure) in the molecule. Thus,
alignments of two sequences that specify the same RNA molecules will show
covariations at interacting base-pair positions. In addition to these covariable
positions, sequences of RNA-specifying genes may also have rows of similar
sequence characters that reflect the common ancestory of the genes.
Computational methods are available for predicting the most likely
regions of base-pairing in an RNA molecule. Methods for predicting the
structure of RNA molecules include: (i) an analysis of all possible
combinations of potential double-stranded regions by energy minimization
methods and (ii) identification of base covariation that maintains secondary
and tertiary structure of an RNA molecule during evolution (Covariation
analysis led to the prediction of three domains of life – the Bacteria, the
Eukarya and Archae by C. Woese).
7.1.1 Gene Prediction Programs

There are many commonly used methods, which are freely available in the
public domain. GRAIL1 (Gene Recognition and Analysis Internet Link) makes
use of a neural network method to recognize coding potential in fixed-length
(100 base) windows considering the sequence itself, without looking for
additional features such as splice junctions or start or stop codons.
GRAIL2 uses variable-length windows. GRAIL-EXP uses additional
information in making the prediction, including a database search of known
complete and partial gene messages. FGENEH is a method that predicts
internal exons by looking for structural features such as donor and acceptor
splice sites, putative coding regions and intronic regions both 5’ and 3’ to the
putative exon, using linear discriminant analysis.
FGENES, an extension of FGENEH, is used in cases when multiple genes
are expected in a given stretch of DNA. MZEF uses quadratic discriminant
analysis to predict internal coding exons. GENSCAN predicts complete gene
structures. It can identify introns, exons, promoter sites, and polyA signals. It
relies on probabilistic model. GenomeScan program assigns higher score to
putative exons. PROCRUSTES program takes genomic DNA sequences and
forces them to fit into a pattern as defined by a related target protein.
GeneID finds exons based on measures of coding potential. GeneParser
uses a neural network approach to determine whether each subinterval
contains a first exon, internal exon, final exon or intron. HMMgene predicts
whole genes in any given DNA sequence. Different methods produce different
types of results. No one program provides the foolproof key to computational
gene identification.
Web Addresses
FGENEX : http://genomic.sanger.ac.uk/gf/gf.shtml
GeneID : http://www1.imim.es/geneid.html
GeneParser : http:// beagle.colordo.edu/~eesnyder/GeneParser.html
GENSCAN : http://genes.mit.edu/GENSCAN.html
GRAIL : http:// compbio.ornl.gov/tools/index.shtml
GRAIL-Exp : http:// compbio.ornl.gov/grailexp/
HMMgene : http://www.cbs.dtu.dk/services/HMMgene/
MZEF : http://www.cshl.org/genefinder
PROCRUSTES : http://www-hto.usc.edu/software/procrustes
7.2 PROTEIN PREDICTION STRATEGIES

One of the major goals of bioinformatics is to understand the relationship
between amino acid sequence and three dimensional structures of proteins. If
this relationship is known, then the structure of a protein could reliably be
predicted from the amino acid sequence. Prediction of these structures from
sequence is possible using presently available methods and information.
The alphabet of 20 amino acids found in protein allows for much greater
diversity of structure and function, primarily because the differences in the
chemical makeup of these residues are more pronounced. Each residue can
influence the overall physical properties of the protein because these amino
acids are either basic or acidic, hydrophobic or hydrophilic and have straight
chains, branched chains or are aromatic. Thus, each residue has certain
quality to form structures of different types in the context of a protein domain
(sequence specific conformation).
The first step in predicting the three-dimensional shape of a protein is
determining what regions of the backbone are likely to form helices, strands,
and beta turns, the U-turn like structures formed when a beta strand reverses
direction in an antiparallel beta sheet.
Prediction Methods
Modeling the structure of biological macromolecules allows us to gain a great
deal of insight into the molecule’s functional features. Modeling unknown
protein structures based on their homologs is known as homology-based
structural modeling. In this type of modeling, the experimentally determined
structures are generally referred to as the ‘templates’ and the sequence
homology (a novel one) that lacks structural coordinates is called the ‘target’
sequence.
The homology-based protein modeling approach entails four sequential
steps. The first step involves the identification of known structures that are
related in sequence to the target sequence using BLAST. In the second step, the
potential templates are aligned with the target sequence to identify the closest
related template. In the third step, a model of the target sequence is calculated
from the most suitable template in step two. The fourth step involves the
evaluation of the modeled target sequence using different criteria.
The knowledge of evolutionarily conserved structural features of similar
proteins from other species enables us to gain insight into the structure of the
target sequence.
The observation that each protein folds spontaneously into a unique three-
dimensional native conformation implies that nature has an algorithm for
predicting protein structure from amino acid sequence. Some attempts to
understand this algorithm are based solely on general physical principles;
others are based on observations of known amino acid sequences and protein
structures. A proof of our understanding would be the ability to reproduce the
algorithm in a computer program that could predict protein structure from
amino acid sequence.
Most attempts to predict protein structure from basic physical principles
alone try to reproduce the inter-atomic interactions in proteins, to define a
compatible energy associated with any conformation. Computationally, the
problem of protein structure prediction then becomes a task of finding the
global minimum of this conformational energy function. So far this approach
has not succeeded, partly because of the inadequacy of the energy function
and partly because the minimization algorithms tend to get trapped in local
minima.
The alternative to a priori methods is the approach based on assembling
clues to the structure of a target sequence by finding similarities to known
structures. These are empirical or knowledge-based methods.
The Ramachandran Plot

Ramachandran plot (also known as Ramachandran diagram or a phi (Φ) psi
(Ψ) plot was originally developed in 1963 by G.N. Ramachandran, C.
Rmakrishnan and V. Sasisekharan. It is a way to visualize the backbone
dihedral (torsional) angles phi against psi of amino acid residues in protein
structure. It shows the possible conformation of phi and psi angles for a
polypeptide. The psi angle of the peptide bond is normally 180° since the
partial – double – bond character keeps the peptide planar. The backbone
conformation of an entire protein can be specified in terms of the phi and psi
angles of each amino acid.
The torsional angles of each residue in a peptide define the geometry of
its attachment to its two adjacent residues by positioning its planar peptide
bond relative to the two adjacent planar peptide bonds thereby the torsional
angles determine the conformation of the residues and the peptides.
In sequence order phi is the (N(i-1), C(i), Ca(i), N(i)) torsion angle and psi
is the (C(i), Ca(i), N(i), C (i+1)) torsion angle. Ramachandran plotted psi values
on the X-axis and psi values on the Y-axis. Plotting the torsional angles in this
way graphically shows which combinations of angles are possible.
Ramachandran plot is used to judge the quality of a model by finding residues
that are in unlikely or high energy conformation (Figure 7.2).
180
aL
GG
G
y 0
G G
G GG
–180 G
–180 0 180
f
Fig. 7.2 Sasisekharan – Ramakrishnan – Ramachandran plot of acylphosphatase (PDB

7.2
code 2 ACY). Note the clustering of residues in the α and β regions, and that most of the
exceptions occur in glycine residues (labelled G).
A Ramachanran plot can be used in two somewhat different ways. One

is to show in theory which values or conformations of the phi and psi angles
are possible for an amino acid residue in a protein. A second is to show the
empirical distribution of data points observed in a single structure in usage for
structure validation or else in database for many structures.
Bridging the Sequence Structure Gap

An understanding of structure leads to an understanding of function and
mechanism of action. There is a big gap between known sequences and known
structures. This gap is called sequence structure gap. This is the main factor for
prediction of protein structure. Structure prediction means to make a
prediction of the relative position of every protein atom in three-dimensional
space using only information from the protein sequence.
Structure prediction is done using categories like comparative modeling,
fold recognition, secondary structure prediction, ab initio prediction and
knowledge-based prediction. Knowledge-based methods attempt to predict
protein structure using information taken from the database of known
structures.
If a sequence of known structure (target sequence) can be aligned with
one or more sequences of known structure to show at best 25% identify in an
alignment of 80 or more residues, then the known structure (template
structure) can be used to predict the structure adopted by the target sequence,
using multiple alignment tools. This is comparative modeling (homology
modeling). It produces a full atom model of tertiary structure.
When suitably related template structures do not exist for a particular
target sequence, secondary structure prediction is an alternative. It provides a
prediction of the secondary structure state of each residue, either helical,
strand or extended, or coil. The predictions are sometimes known as three-state
predictions.
Fold-recognition (α-reading) methods detect distant relationships and
separate them from chance sequence similarities not associated with a shared
fold. They operate by searching through a library of known protein structures
and finding the one most compatible with query sequence whose structure is
to be predicted. Once the alignment between the sequence and the distantly
related known structures has been obtained, a full three-dimensional structure
of the protein to be predicted can be obtained.
Ab initio methods attempt to predict protein structures from first
principles using theories from the physical sciences like statistical thermo-
dynamics and quantum mechanics. Of all these methods, comparative
modeling is the most accurate and comprehensive structure prediction
method.
7.2.1 Secondary Structure Prediction

Accurate prediction as to where α-helices, β-strands and other secondary
structures will form along the amino acid chain of proteins is one of the
greatest challenges in sequence analysis. Methods of structure prediction from
amino acid sequence begin with an analysis of a database of known
structures. These databases are examined for possible relationships between
sequence and structure.
The ability to predict secondary structure also depends on identifying
types of secondary structural elements in known structures and determining
the location and extent of these elements. The main types of secondary
structures that are examined for sequence variation are α-helices, β-strands
and coils.
The basic assumption in all secondary structure prediction is that there
should be a correlation between amino acid sequence and secondary structure.
The usual assumption is that a given short stretch of sequence is more likely to
form one kind of secondary structure than another.
Two early methods based on secondary structure propensity were those
of Chou and Fasman and GOR (Garnier-Osgathorpe-Robson). These were
based on the local amino acid composition of single sequences. Later the use of
evolutionary information from multiple alignments improved the accuracy of
secondary structure prediction methods significantly since during evolution,
structure is much more strongly conserved than sequence.
Widely used methods of protein secondary predictions are: (i) the Chou-
Fasman and GOR methods, (ii) neural network models and (iii) nearest-
neighbor methods.
Chou-Fasman Method
Chou-Fasman method is based on the assumption that each amino acid
individually influences secondary structure within a window of sequence. It is
based on analyzing the frequency of each of the 20 amino acid in α-helices,
β-strands and turns. To predict a secondary structure, the following set of rules
is used.
The sequence is first scanned to find a short sequence of amino acids that
has a high probability for starting a nucleation event that could form one type
of structure. For α-helices, a prediction is made when four of six amino acids
have a high probability of > 1.03 of being in an α-helix. For β-strands, the
presence in a sequence of three of five amino acids with a probability of >1.00
of being in a β-strand. These nucleated regions are extended along the
sequence in each direction until the prediction values for four amino acids
drop below 1. If both α-helices, β-strand regions are predicted, the higher
probability prediction is used.
Turns are predicted a little differently. Turns are modeled as a
tetrapeptide, and two probabilities are calculated. First, the average of the
probabilities for each of the four amino acids being in a turn is calculated as
for the α-helix and β-strand prediction. Second the probabilities of amino acid
combinations being present at each position in the turn of tetrapeptide are
determined.
These probabilities for the four amino acids in the candidate sequence
are multiplied to calculate the probability that the particular tetrapeptide is a
turn. A turn is predicted when the first probability value is greater than the
probabilities for an α-helix and β-strand in the region and when the second
probability value is greater than 7.5 × 10-5.
GOR Method
GOR method is based on the assumption that amino acids flanking the central
amino acid residue influence the secondary structure that the central residue is
likely to adopt. It uses the principles of information theory to derive
predictions. Known secondary structures are scanned for the occurrence of
amino acids in each type of structure. The frequency of each type of amino acid
at the next 8 amino-terminal and carboxy-terminal positions is also
determined, making the total number of positions examined equal to 17,
including the central one.
Neural Network Prediction
In the neural network approach, computer programs are trained to be able to
recognize amino acid patterns that are located in known secondary structures
and to distinguish these patterns from other patterns not located in these
structures. These neural network models extract more information from
sequences theoretically. PHD and NNPREDICT are two neural network
programs. Neural network models are meant to simulate the operation of the
brain.
Nearest-neighbor Prediction
Like neural networks, nearest-neighbor methods are also a type of
machine learning method. They predict the secondary structural conformation
of an amino acid in the query sequence by identifying sequences of known
structures that are similar to the query sequence. A large list of short sequence
fragments is made by sliding a window of varied length along a set of
approximately 100-400 training sequences of known structure.
The minimal sequence similarity to each other and the secondary
structure of the central amino acid in each window is recorded. A window of
the same size is selected from the query sequence and compared to each of the
above sequence fragments, and the 50 best matching fragments are identified.
The frequencies of the known secondary structure of the middle amino acid in
each of these matching fragments are then used to predict the secondary
structure of the middle amino acid in the query window.
7.2.2 Propensity for Secondary Structure Formation

A number of attempts have been made to predict the secondary structure
by using the amino acid sequence alone. Solution studies of model
polypeptides have indicated that amino acids show large variations in their
propensity to adopt regular conformations. The earliest attempts at secondary
structure prediction were based on parameterization of physical models. These
physico-chemical studies on model polypeptides indicated that the propensity
of an amino acid to extend a helix could be different from its propensity to
nucleate a helix.
Chou and Fasman suggested an approach that was based on a statistical
model. In this approach, the frequency of occurrence of a particular amino acid
in a particular conformation is compared with the average frequency of
occurrence of all amino acids in that conformation. The resulting ration is the
propensity of the amino acid to occur in that conformation. These values were
used to classify amino acids into different classes and to formulate rules for
secondary structure prediction.
Both Chou and Fasman and GOR methods make use of the idea of
secondary structure propensity. The amino acids seem to have preferences for
certain secondary structure states, which are shown in Table 7.1. For instance,
glutamic acid has a strong preference for the helical secondary structure, and
valine has lower than average propensity for both types of regular secondary
structure, reflecting a tendency to be found in loops.
Table 7.1: Helical and strand propensities of the amino acids. A value of 1.0
indicates that the preference of that amino acid for the particular secondary
structure is equal to that of the average amino acid; values greater than one indicate
a higher propensity than the average; values less than one indicate a lower
propensity than the average (The values are calculated by dividing the frequency
with which the particular residue is observed in the relevant secondary structure by
the frequency for all residues in that secondary structure).
Amino acid Helical (α) propensity Helical (β) propensity

GLU 1.59 0.52
ALA 1.41 0.72
LEU 1.34 1.22
MET 1.30 1.14
GLN 1.27 0.98
LYS 1.23 0.69
ARG 1.21 0.84
HIS 1.05 0.80
VAL 0.90 1.87
ILE 1.09 1.67
TYR 0.74 1.45
CYS 0.66 1.40
TRP 1.02 1.35
PHE 1.16 1.33
THR 0.76 1.17
GLY 0.43 0.58
ASN 0.76 0.48
PRO 0.34 0.31
SER 0.57 0.96
ASP 0.99 0.39
The accuracy of these early methods based on the local amino acid
composition of single sequences was fairly low, with often less than 60% of
residues being predicted in the correct secondary structure state.
7.2.3 Intrinsic tendency of amino acids to form β-turns

Crystal structure data were analyzed to calculate the frequency of occurrence
of pairs of amino acids in β-turns. The observed frequencies were, Pro-Asn
(63%), Pro-Phe (50%), Pro-Gly (38%), Pro-Ser (31%) and Pro-Val (8%). However,
a statistical analysis using a different criterion for assigning b-turns found a
substantial difference in the order of preference. The order of preference was
found to be: Pro-Gly> Pro-Asn> Pro-Ser> Pro-Val> Pro-Phe, in the set of
protein structures in the database.
The propensity to form β-turns was evaluated by measuring the standard
Gibbs free energy of peptide cyclization in the model tetrapeptides cys-Pro-X-
Pro. The observed order of preference was found to be Pro-Asn> Pro-Gly> Pro-
Ser> Pro-Phe> Pro-Val. Measurements of the temperature dependence of the
(NMR) chemical shifts in the model peptides Tyr-Pro-X-Asp-Val provides an
indication of the b-turns populations. These NMR data indicated that the b-
turns populations were in the order Pro-Gly> Pro-Asn> Pro-Phe> Pro-Ser>
Pro-Val.
A combined analysis of the thermodynamics solution NMR and crystal
(statistical) structure data indicate that the order of preference is Pro-Gly, Pro-
Asn> Pro-Ser> Pro-Val. Although the relative position of Pro-Phe appears to be
highly variable in this series, for other peptides there is a reasonable
correlation between the statistical preferences calculated from the database of
protein structures and the preferences based on thermodynamic and NMR
measurements on model compounds.
7.2.4 Rotamer Libraries

Rotamers are low energy conformations of side chains. Pioneering work on
side chain conformational preferences has indicated that a few side chain
conformers are much more likely than others. This result stimulated a number
of studies to characterize the probability that a given side chain will occur in a
particular conformation in a given amino acid, and its dependence on the
main chain conformation.
Using the vastly improved size of the database, a number of such rotamer
libraries have been developed. The Rotamer libraries can be used in molecular
modeling to add the most likely side-chain conformation to the backbone.
7.2.5 Three-Dimensional Structure Prediction

Protein structural comparisons have shown that newly found protein
structures often have a similar structural fold or architecture to an already-
known structure. Structural comparisons have also revealed that many
different amino acid sequences in proteins can adopt the same structure fold.
Examination of sequences in structures has also revealed that the same short
amino acid patterns may be found in different structural contexts.
Structural alignment studies have revealed that there are more than 500
common structural folds found in the domains of the more than 12500 three-
dimensional structures that are in the Brookhaven Protein Data Bank. These
studies have also revealed that many different sequences will adopt the same
fold. Thus, there are many combinations of amino acids that can fit together
into the same three-dimensional conformation, filling the available space and
making suitable contacts with neighboring amino acids to adopt a common
three-dimensional structure.
There is also a reasonable probability that a new sequence will possess
an already identified fold. The object of fold recognition is to discover which
fold is best matched. Hidden Markov Model (discrete state-space model) and
threading are used to predict three-dimensional structures.
If two proteins share significant sequence similarity, they should also
have similar three-dimensional structures. The similarity may be present
throughout the sequence lengths or in one or more localized regions having
relatively short patterns that may or may not be interrupted with gaps. When a
global sequence alignment is performed, if more than 45% of the amino acid
positions are identical, the amino acids should be quite superimposable in the
three-dimensional structure of the proteins.
Thus, if the structure of one of the aligned proteins is known, the
structure of the second protein and the position of the identical amino acids in
this structure may be reliably predicted. If less than 45% but more than 25% of
the amino acids are identical, the structures are likely to be similar, but with
more variation at the lower identity levels at the corresponding three
dimensional positions.
7.2.6 Comparative Modeling

Comparative modeling, commonly referred to as homology modeling, is useful
when a 3D structure of a sequence that shares substantial similarity to the
protein sequence of interest is available. The two sequences are aligned to
identify segments that share sequence similarity. If more than one structure is
available, multiple sequence alignment is used.
It is noted that the reliability of structure prediction from the comparative
modeling approach increases substantially if more than one structure of a
protein with substantial sequential similarity is available. The efficiency of the
alignment substantially affects the accuracy of the subsequent structure
prediction.
After alignment has been used to identify corresponding residues, the
structure of the desired protein is predicted by making use of the structure of
the homology. Several algorithms are available for this step. They can be
broadly classified as: (i) rigid body assembly, (ii) segment matching and (iii)
satisfaction of spatial restraints.
In the rigid body assembly approach, the structure is assembled from
rigid bodies that represent the core, loop regions, side chains, etc. These rigid
bodies are identified from related structures and added onto a framework that
is obtained by averaging the positions of the template atoms in the conserved
regions of the fold. In the segment matching procedure coordinates are
calculated from the approximate positions of conserved atoms of the templates.
For this, use is made of a database of short segments of protein structure.
This may be supplemented by energy or geometry rules. The alignment of
the sequence of interest with one or more structural templates can be used to
derive a set of distance constraints. Subsequently, distance geometry or
restrained energy minimization or restrained molecular dynamics can be used
to obtain the structure.
Steps
Steps in comparative modeling are:
1. Align the amino acid sequences of the target and the protein or proteins
of known structure.
2. Determine main chain segments to represent the regions containing
insertions or deletions. Stitching these regions into the main chain of
the known protein creates a model for the complete main chain of the
target protein.
3. Replace the side chains of residues that have been mutated. For
residues that have not mutated, retain the side chain conformation.
4. Examine the model (both by eye and by programs) to detect any serious
collisions between atoms. Remove these collisions.
5. Refine the model by limited energy minimization.
7.2.7 Threading
Threading is a method for fold recognition. Given a library of known
structures and a sequence of a query protein of unknown structure, does the
query protein share a folding protein? Threading is a technique to match a
sequence with a protein shape. Threading is based on the observation that
even proteins that have very low sequence identity often have similar
structures.
Threading may be used in the absence of any substantial sequence
identity to proteins of known structure, whereas, comparative modeling
requires protein structures that have substantial sequence similarity to the
protein sequence of interest. The sequence of interest is matched against a
database of known folds and the protein is assumed to have the same fold as
the best match.
Theoretical considerations indicate that the total number of possible
folds for proteins is limited. Hence it is possible to predict the structure of a
representative protein for each possible fold. The basic idea of threading is to
build many rough models of the query protein, based on each of the known
structures and using different possible alignments of the sequences of the
known and unknown proteins.
Threading approaches may be on sequence information, structural
information or both. The two essential components of threading are: (a) finding
an optimal alignment (with gaps) of a sequence onto a structure and (b)
scoring different alignments and deciding on the best shape. Scoring may be
carried out by (i) mapping the structural information to create a profile for each
structural site, or (ii) using a potential based on pairwise interactions.
In general, the models based on pairwise interactions have greater
discriminatory ability. However, it is more difficult and more computationally
expensive to find an optimal alignment using a pairwise interaction potential.
7.2.8 Energy-based Prediction of Protein Structure
The essence of energy based approaches to compute the conformation
dependent potential energy for different conformations; the conformation with
the lowest energy is assumed to be the structure of the molecule under
investigation. The form of the potential energy function is based upon the
known physics of interacting bodies. The potential energy function contains
terms corresponding to well understood interactions such as coulombic
interaction between charged bodies, terms for interaction between polarizable
atoms etc.
In the case of force fields that are variable geometry, terms are included
for deviations from an assumed ‘ideal geometry’. The ideal geometries for
different residues are defined, based on examination of high-resolution
structures of model compounds. The parameters for the potential energy
function may be obtained from ab initio quantum mechanical calculations or
from thermodynamic, spectroscopic or crystallographic data or a combination
of these three.
Ab initio based attempts to locate global energy minimum have been less
successful than the knowledge based approaches. The reason for this has
been: (i) the inaccuracy of existing energy functions and (ii) the computational
difficulty in searching for the global minimum.
The development of energy based methods (in particular force field based
fully on the physics of interacting bodies and capable of recognizing the native
structure as the lowest-energy one) would be a major step forward towards
understanding the role of particular interactions in the formation of protein
structure and the mechanisms of protein folding.
For practical reasons, a global minimum search of real-size proteins is
unfeasible at the all-atom level; therefore united-residue models of polypeptide
chains have received greater attention. After the global minimum is found at
the united-residue level, it can be converted to all-atom representation and
limited exploration of the conformational space in the neighborhood of the
converted structure. The whole approach is referred to as hierarchical
approach to protein folding.
7.2.9 Protein Function Prediction

Comparison of protein structures may reveal relationships with distant
homology of known function, and this homology might be used to predict
function. If the homologs have a high degree of sequence similarity then
methods based on sequence comparison might be adequate for identifying the
relationships. However, when the sequence similarity is low, then a
comparison of protein structure might reveal relationships that were not
evident by using methods that are based only upon analysis of the sequences
of the proteins.
As proteins evolve they may (i) retain function and specificity, (ii) retain
function but alter specificity, (iii) change to a related function or a similar
function in a different metabolic context, and (iv) change to a completely
unrelated function. Proteins of similar structure and even of similar sequence
can be recruited for very different functions. Very widely diverged proteins
may retain similar functions. Moreover just as many different sequences are
compatible with the same structure, unrelated proteins with different folds can
carry out the same functions.
In a series of homologous enzymes the identification of a set of highly
conserved residues that are spatially close but are not required for structural
stabilization might indicate that they are the active site residues. The nature of
the active site residues might provide clues about the function and the
mechanism of action of the enzyme.
Domains
Certain proteins contain specific modules that mediate protein-protein
interactions. The identification of such domains in a particular protein can
provide clues about its interacting partners. For example, the presence of an
SH2 domain or a PTB domain in a protein indicates that it will bind to another
protein containing phosphotyrosine residue.
The presence of the monomeric PD2 domain indicates that it might
interact either with another protein that contains a PDZ/LIM domain or with
the C-terminal region of membrane proteins. The presence of a Pleckstrin
homology domain in a protein indicates that it is likely to be involved in signal
transduction and that it might bind to the acid rich regions of protein involved
in signal transduction or to phosphoinositides.
In X-ray diffraction studies of crystals, the technique of molecular
replacement is used to obtain an initial set of phases. If a protein that shares
substantial sequence similarity with the protein of interest is available in the
database, then its structure may be used for building a model of the protein of
interest using comparative modeling.
The coordinates of the atoms in this structure can be used for calculating
the structure factors. The phase of the resulting structure factors and the
measured values of the magnitudes of the structure factors are then used for
calculation of a new electron density model. The resulting model can then be
subjected to Fourier or least-squares refinement.
7.3 PROTEIN PREDICTION PROGRAMS

A number of computational tools have been developed fro making predictions
regarding the identification of unknown proteins based on chemical and
physical properties of each of the 20 amino acids. Many of these tools are
available through ExPASY server at the SWISS Institute of Bioinformatics.
AACompIdent uses the amino acid composition of an unknown protein to
identify known proteins of the same composition. AACompSim, a variant of
AACompIdent, uses the sequence of a SWISS-PROT protein.
PROPSEARCH uses amino acid composition of a protein to detect weak
relationships between proteins to discern members of the same protein family.
MOWSE (Molecular Weight Search) algorithm uses information obtained
through mass spectrometric techniques. There are a few other tools, which
help to analyse physical properties based on sequence. ComputepI/MW and
ProtParam calculate the isoelectric point and molecular weight of an input
sequence.
PeptideMass determines the cleavage products of a protein after
exposure to a given protease or chemical reagent. TGREASE calculates the
hydrophobicity of a protein along its length. SAPS (Statistical Analysis of
Protein Sequences) algorithm provides extensive statistical information for any
given query sequence.
There are a few other tools used to analyse motifs and patterns. BLAST
searches are performed to identify sequences in the pubic databases that are
similar to a query sequence of interest. PSI.BLAST is used to identify new,
distantly related members of a protein family called pfscan to find similarities
between a protein or nucleic acid query sequence and a profile library.
BLOCKS database utilizes the concept of blocks to identify a family by
using similar family of sequences. Profilescan uses a method of proteins, rather
than relying on the individual sequences themselves. CDD (Conserved
Domain Database) is used to identify conserved domains within a protein
sequence.
There are a few tools which are used to analyse secondary structure and
folding classes. The nnpredict algorithm uses a two-layer, feed-forward neural
network to assign the predicted type for each residue based on FASTA format.
PredictProtein uses SWISS-PROT, MaxHom and PHDsec algorithms to predict
secondary structure.
The PREDATOR algorithm uses database-derived statistics on residue-
type occurrences in different classes of local hydrogen-bonded structures. The
PSIPRED uses two feed-forward neural networks to perform the analysis on
the profile obtained from PSI-BLAST.
SOPMA (Self-Optimized Prediction Method) builds sub-databases of
protein sequences with known secondary structure prediction based on
sequence similarity. The information from the sub-databases is then used to
generate a prediction on the query sequence. SOPMA is a combination of five
other methods (Garnier-Gibrat-Robson (GOR) method, Levin homolog method,
double-prediction method, PHD method, and CNRS method).
Jpred integrates six different structure prediction methods and returns a
consensus prediction based on simple majority rule. The Jpred server runs
PHD, DSC, NNSSP, PREDATOR, ZPRED and MULPRED.
There are some algorithms which are useful to identify specialized
structures or features. COILS algorithm runs a query sequence against a
database of proteins known to have a coiled-coil structure. TMpred and
PHDtopology are used to predict transmembrane regions. SignalP is used to
detect signal peptides and their cleavage sites. SEG is used to detect
nonglobular regions. DALI, SWISS-MODEL and TOPITS are used for tertiary
structure prediction.
ROSETTA is a program that predicts protein structure from amino acid
sequence by assimilating information from known structures. It predicts a
protein structure by first generating structures of fragments using known
structures, and then combining them. LINUS (Local Independently Nucleated
Units of Structures) is a program for prediction of protein structure from amino
acid sequence. It is a completely a priori procedure, making no explicit
reference to any known structure or sequence structure relationships.
7.4 MOLECULAR VISUALIZATION

Molecular visualization helps scientists to bioengineer the protein molecules.
There are a number of softwares, both free and commercial, which help in
visualizing biomolecules. The most commonly used free softwares are:
RasMol, Chime, MolMol, Protein explorer and Kinemage.
RasMol is derived from Raster (the array of pixels on a computer screen)
and Molecules.This is a molecular graphics program intended to visualize
proteins, nucleic acids and small molecules for which a 3D structure is
available. In order to display a molecule, RasMol requires an atomic
coordinate file that specifies the position to every atom in the molecule through
its 3D Cartesian coordinates. RasMol accepts this coordinate file in a variety of
formats including PDB format. The visualization provides the user a choice of
color schemes and molecular representation. RasMol can be run outside a web
browser. The home page is : www.umass.edu/microbio/rasmol
RasMol and RasTop

RasMol is the program for molecular visualization. RosTop is the graphical
interface to RasMol. Roger Sayle from the Biocomputing Research unit at the
University of Edinburgh, UK and Biomolecular Structure Department, Glaxo
Research and Development, Greenford, UK, developed RasMol initially.
RasTop helps in viewing and manipulating macromolecules and
micromolecules on screen. It is user friendly. Each command in the menu
generates its own script which is transferred to RasMol.
RasTop is helpful in addition or subtraction of atoms, groups, or chains
in selection on screen with a lasso, in going back to the previous selection,
copying and pasting selections, in setting operations such as inverse,
extraction, summation, subtraction, exclusion, and in saving work session
under a script format called RSM script.
RasTop permits opening of several molecules at the same time in the
same window and several windows at the same time.
Procedure
A. When we want to measure the band length the following steps can be used:
1. Open RasMol and load a file of pdb atom coordinates (downloaded
from the PDB databank).
2. Use various menu options to get a feel of the molecule.
3. Open RasTop, the molecular visualization tool.
4. From the file menu, open a PDB atom coordinate file.
5. Roatate the molecule.
6. Use the options in the menu and command line.
7. Set the display style to ball and stick.
8. Zoom the molecule to visualize the bonds better (shift + mouse down).
9. Go to command line window
10. Type: Set picking distance and press Enter key.
11. Go to the display window and select the two atoms participating in the
bond formation by clicking on them successively.
12. The bond length will appear in the command line window.
13. Note down the results.
If we want to show a bond and measure the band length between
two atoms, we can also use the following after going to command
line window.
14. Type set picking monitor in the command line window.
15. Click on the atom again (only once). A band line will appear.
16. Note down the results from the command line window.
B. When we want to measure band angle the following steps can be used
1. Set the display style to ‘ball and stick’
2. Zoom the molecule (shift + mouse down)
4. Type set picking angle and press
5. Go to display window and select the three atoms forming the bond
angle by clicking on them successively.
6. The bond angle will appear in the command line window.
7. Note down the results
C. When we want to measure the torsion angle the following steps can be used:
1. Set the display style to ‘ball and stick’
2. Zoom the molecule (shift + mouse down)
4. Type set picking torsion and press Enter key
5. Go to display window and select the four atoms forming phi and psi
angle by clicking on them successively.
[The clik sequence for phi is: carbonyl C of residue (i-1), N of residue
i,CA of residue i, and carbonyl C of residue i. The click sequence for
psi is: N of residue (i+1), CA of residue i, carbonyl C of residue i, and
N of residue (I+1)].
6. After successive clickings the torsion angle will appear in the command
line window.
7. Note down the results
RasTop is available on window, Linux and Mac platforms. To
install extract the RasTop folder from the RasTop. Zip file and
install in any directory. To start RasTop, double click on the RasTop
icon. It will display a single main window with one empty graphic
window, the color window and the command line window.
To view the molecule we have to load the correct file after choosing
the correct path. Then we can click molecule to select information
about the molecule. In the command line use the option Show to get
information about world, atom selection, group selection, chain
selection, coordinates, phi, psi, Ramprint, sequence, symmetry, etc.
The main menu window has click Atoms button. Select Spacefill
and display; after click atom select lablels and display. The RasMol
’Spacefill’ is used to represent all of the currently selected atoms as
solid sphere. This command is used to produce both union-of-
spheres and ball-and-stick models of a molecule. [The following
command line uses RasMol and RasTop: spacefill <boolean>],
spacefill temperature, spacefill user, spacefill [-] <value>
To know the bonds click Bonds and select Hbonds and display. In
3D structure dotted lines represent Hbonds; after viewing, close
bond by clicking remove button. To see the display of the loaded
protein of the ribbon form (a smooth solid ribbon surface, passing
along the backbone of the protein) click ribbon and select ribbons
simultaneously working with others such as strands, cartoons,
Trace and Backbone. After display click Remove button. We can
learn more about RasTop by exploring ‘Help RasTop’.
Chime and Protein explorer are derivatives of RasMOl that allow visualization
inside web browsers. Hence, it can be used only online. Chime can be reached
at www.Umass.edu/microbio/Chime
MolMol stands for Molecule analysis and Molecule display. MolMol is a
molecular graphics program for display, analysis and manipulation of three-
dimensional structures of biological macromolecules with special emphasis on
NMR solution structures of proteins and nucleic acids. MolMol can be reached
at www.mol.biol.ethz.ch/ wuthrich/software/molmol
Kinemage (kinetic images) allows the user to move two molecules or
parts of a molecule complex, relative to each other. Molscript is a tool for
making cartoons of secondary structural elements. Grasp is used for
visualization of the surface. Swiss-pdbviewer produces high quality images
using ray tracing methods. Insight II is a commercial software that also
supports hardware for interactive 3D viewing.
Some websites
ComputepI/MW : http://www.expasy.ch/tools/pi_tool.html
MOWSE : http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
PeptideMass : http://www.expasy.ch/tools/peptide-mass.html
TGREASE : http://ftp.virginia.edu/pub/fasta/
SAPS : http://www.isrec.isb-sib.ch/software/SAPS_form.html
AACompIdent : http://www.expasy.ch/tools.aacomp/
AACompsim : http://www.expasy.ch/tools/aacsim/
ROPSEARCH : http://www.embl-heidelberg.de/prs.html
BLOCKS : http://blocks.fhcrc.org
Pfam : http://www.sanger.ac.uk/software/Pfam/
PRINTS : http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html
ProfileScan : http://www.isrec.isb-sib-ch/software/PFSCAN-form.html
npredict : http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
PredictProtein : http://www.embl-heidelberg.de/predictprotein/
SOPMA : http://pbil.ibcp.fr/
Jpred : http://jura.ebi.ac.uk:8888/
PSIPRED : http://insulin.brunel.ac.uk/psipred
PREDATOR : http://www.embl-heidelberg.de/predator/predator_ifno.html
COILS : http://www.ch.embnet.org/software/COILS_form.html
PHDtopology : http://www.embl-heidelber.de/predictprotein
SignalP : http://www.cbs.dtu.dk/services/signalP/
Tmpred : http://www.isrec.isb-sib.ch/ftp-server/tmpred/www/TMPRED_form.html
DALI : http://wwwz.ebi.ac.uk/dali/
SWISS-MODEL : http://www.expasy.ch/swissmod/SWISS-MODEL.html
TOPITS : http://www.embl-heidelberg.de/predictprotein/
STUDY QUESTIONS
1. What are the uses of prediction?
2. What are the strategies used in gene prediction?
3. How do we predict mRNA structure?
4. Give examples of some of the commonly used methods for gene
prediction.
5. What is the necessity to predict protein structures?
6. What is Ramachndran plot? What are its uses?
7. How do we predict secondary structure?
8. Describe the intrinsic tendency of amino acids to form b-turns.
9. What is Rotamer Library?
10. Distinguish between ab initio and knowledge-based methods of
prediction?
11. How is comparative modeling done?
12. What are the steps involved in comparative modeling?
13. What is threading?
14. What is energy-based prediction?
15. How is protein function prediction done?
16. Give examples of some protein prediction programs.
17. What is molecular visualization? Give some examples of programs for
molecular visualization?
C H A P T E R
Homology, Phylogeny and

8
Evolutionary Trees
Homology specifically means descent from a common ancestor. Usually

descendants of a common ancestor show similarities in several characters.
Such characters are called homologous characters. Charles Darwin studied the
Galapagos finches in 1835, noting the differences in the shapes of their beaks
and the correlation of beak shape with diet. Finches that eat fruits have beaks
like those of parrots, and finches that eat insects have narrow, prying beaks.
These observations were seminal to the development of Darwin’s ideas on the
theory of evolutionary basis for the origin of species.
8.1 HOMOLOGY AND SIMILARITY

Many a time the word homology and similarity are used interchangeably even
though they are technically different. Similarity is the measurement of
resemblance or difference and it is independent of the source of the
resemblance. Similarity can be observed in the data that are collectable at
present and it involves no historical hypothesis. In contrast, assertions of
homology require inferences about historical events, which are almost always
unobservable. Similarity is quantifiable but homology is more qualitative.
Sequences are said to be homologous if they are related by divergence
from a common ancestor. When protein folds are similar but the sequences are
different, such folds are usually considered to be analogous. The essence of
sequence analysis is the detection of homologous sequences by means of
routine database searches, usually with unknown or uncharacterized query
sequences. Homology is not a measure of similarity, but an absolute statement
that sequences have a divergent rather than a convergent relationship.
Sequences that share an arbitrary, threshold level of similarity
determined by alignment of matching bases are termed homologous. They are
inherited from a common ancestor that possessed similar structure, although
the structure of the ancestor may be difficult to determine because it has been
modified through descent.
Orthologs, Paralogs and Xenologs
Homologs are either orthologs, paralogs or xenologs. Homologous genes that
share a common ancestry and function in the absence of any evidence of gene
duplication are called orthologs. (When there is evidence for gene duplication,
the genes in tan evolutionary lineage derived from one of the copies and with
the same function are also referred to as orthologs).
Orthologs are produced by speciation. They represent genes derived from
a common ancestor that diverged due to divergence of the organisms they are
associated with. They tend to have similar function.
Paralogs are produced by gene duplication. They represent genes derived
from a common ancestral gene that duplicated within an organism and then
subsequently diverged. The two copies of duplicated genes and their progeny
in the evolutionary lineage are referred to as paralogs. They tend to have
different functions.
Xenologs are produced by horizontal gene transfer between two
organisms. In other cases, similar regions in sequences may not have a
common ancestor but may have arisen independently by two evolutionary
pathways converging on the same function, called convergent evolution.
Study of Orthologous and Paralogous Proteins

Among homologous sequences, it is useful to distinguish between proteins
that perform the same function in different species (orthologs) and those that
perform different but related functions within one organism (paralogs).
Sequence comparison of orthologous proteins opens the way to the study of
molecular paleontology. In particular cases, construction of phylogenetic trees
has revealed relationships, for example, between proteins in bacteria, fungi
and mammals, and between animals, insects and plants. Such kinds of
inferences are unearthed only by investigation at the molecular level.
The study of paralogous proteins, on the other hand, has provided
deeper insights into the underlying mechanisms of evolution. Paralogous
proteins arose from single genes via successive duplication events. The
duplicated genes have followed separate evolutionary pathways, and new
specificities have evolved through variation and adaptation. The emergence of
different specificities and functions following gene duplication events may be
detected by protein sequence comparison.
For example, different visual receptors (opsins), which diverged from
each other early in vertebrate evolution, are stimulated by different
wavelengths of light. Human long-wavelength opsin (i.e. those sensitive to red
and green light) are more closely related to each other (with around 95%
sequence identity) than either sequence is to the short-wavelength blue-opsins,
or to the rhodopsins (the achromatic receptors), with which they share an
average 43% identity. The complexity that arises from the richness of such
paralogous, and of orthologous, relationships presents a significant challenge
for protein family classification.
Homology, Phylogeny and Evolutionary Trees 8.3
Modular Proteins
Much of the challenge of sequence analysis involves the marriage of biological
information with sequence data. This process is made more difficult by the
problem of orthology versus paralogy. The analytical process is further
complicated by the fact that, sometimes, sequence similarity is confined only to
some part of an alignment. This scenario is encountered, in particular, when
we study modular proteins.
Modules may be thought of as a subset of protein domains; they are
autonomous folding units that are contiguous in sequence, and are frequently
used as protein building blocks. As building components they may be used to
confer a variety of different functions on the parent protein, either through
multiple combinations of the same module, or via combinations of different
modules to form mosaics.
In genetic terms, the spread of modules cannot be explained simply by
gene duplication and fusion events, but is thought to be the result of genetic
shuffling mechanisms. Whatever the actual process, it appears that Nature
behaves rather like a tinker, using a patchwork of existing components to
produce a new, workable whole. Evolution, it seems, does not produce
novelties from the scratch, but works with old material, either transmogrifying
a system to give it new functions, or combining several systems to produce a
more elaborate one.
8.2 PHYLOGENY AND RELATIONSHIPS

Normally living organisms are classified into groups based on observed
similarities and differences. If two organisms are very closely related to each
other, in principle, it is assumed, that they share a recent common ancestor.
Phylogeny is the description of biological relationships, usually expressed as a
tree. Similarities and differences between organisms are used to infer
phylogeny. The study of understanding the evolutionary relationships among
organisms is called phylogenetics.
Phylogenetic analysis refers to the act of inferring or estimating these
relationships. Phylogenetic analysis is the means used to estimate
evolutionary relationships. The evolutionary history inferred from
phylogenetic analysis is usually depicted as branching, treelike diagrams that
represent an estimated pedigree of the inherited relationships among
molecules, organisms or both.
A statement of phylogeny among various organisms assumes homology
and depends on classification. Phylogeny states a topology (patterns of
ancestry) of the relationships based on classification according to similarity of
one or more sets of characters, or on a model of evolutionary processes. In
many cases, phylogenetic relationships based on different characters are
consistent, and support one another.
Evolutionary Tree
The relationships among species, populations, individuals or genes are taken
in the literal sense of kinship or genealogy, that is, assignment of a scheme of
descendants of a common ancestor. The results are usually presented in the
form of an evolutionary tree. Such a tree, showing all descendents of a single
original ancestral species, is said to be rooted. Evolutionary trees determined
from genetic data are often based on inferences from the patterns of similarity
(Fig. 8.1).
Screwwworm fly
Silkworm mpth
Baker’s yeast
Rattlesnake
Bread mold
Kangaroo
Candidda
Humans
Donkey
Monkey
Chicken
Penguin
Pigeon
Horse
Turtle
Dog
Tuna
Pig
Rabbit
Mammals
Vertebrates
Insects
Animals
Fungi
Fig. 8.1 Evolutionary tree of fungi and animals
Phylogenetic analysis of a family of related nucleic acid or protein

sequences is a determination of how the family might have been derived
during evolution. The evolutionary relationships among the sequences are
depicted by placing the sequences as outer branches on a tree. The branching
relationships on the inner part of the tree then reflect the degree to which
different sequences are related. The objective of phylogenetic analysis is to
discover all of the branching relationship in the tree and the branch lengths.
On the basis of the analysis of nucleic acid or protein sequences, the most
closely related sequences can be identified by their position as the neighboring
branches on a tree. When a gene family is found in an organism or group of
organism, phylogenetic relationships among the genes can help to predict
which ones might have an equivalent function.
When the sequences of two nucleic acid and protein molecules found in
two different organisms are similar, they are likely to have been derived from a
common ancestor sequence. A sequence alignment reveals which positions in
the sequences were conserved and which diverged from a common ancestor
sequence. When one is quite certain that the two sequences share an
evolutionary relationship, the sequences are referred to as being homologous.
An evolutionary tree is a two-dimensional graph showing evolutionary
relationships among organisms or evolutionary relationships in genes from
separate organisms. The separate sequences are referred to as taxa, defined as
phylogenetically distinct units on the tree. It is important to recognize that
each nod in the tree represents a splitting of the evolutionary path of the gene
into two different species that are isolated reproductively.
8.2.1 Approaches used in Phylogenetic Analyses

Phenetic (or clustering), cladistic and evolutionary systematic approaches are
used in the study of phylogenetics.
Phenetic and Cladistic Approaches

In phenetic approach, species are grouped together based on phenotypic
resemblance (similarity) and all characters are taken into account. The
phylogenetic relationship achieved through phenetic approach is usually non-
historical. In cladistic approach, species are grouped together only with those
that share derived characters, that is, characters that were not present in their
distant ancestors. Cladistic approach is based on genealogy. This approach is
considered to be the best method for phylogenetic analysis because it accepts
and employs current evolutionary theory, that is, that speciation occurs by
bifurcation (cladogenesis).
The cladistic approach considers possible pathways of evolution, infers
the features of ancestor at each node, and chooses an optimal tree according to
some model of evolutionary change. The basic point behind cladistics is that
members of a group or clade share a common evolutionary history and are
more related to each other than to member of another group.
A given group is recognized by sharing some unique features that were
not present in distant ancestors. These shared and derived characteristics can
be anything that can be observed and described. Usually cladistic analysis is
performed by either multiple phenotypic characters or multiple base pairs or
amino acids in a sequence. Phenetics is based on similarity; cladistics is based
on genealogy.
There are three basic assumptions in cladistics:
(i) any group of organism is related by descent from a common ancestor
(ii) There is a bifurcating pattern
(iii) Change in characteristics occurs in lineages over time.
Clade, Taxon and Node

A clade is a monophyletic taxon. Clades are groups of organisms or genes that
include the most recent common ancestor of all of its members and all of the
descendants of that most recent common ancestor (Clade is derived from the
Greek word ‘klados’ which means branch or twig). A taxon is any named
group of organism but not necessarily a clade. A node is a bifurcating branch
point. Branch lengths correspond to divergence in some cases (Fig. 8.2).
Rode Man
a clade
Chimpanzee
Rhesus monkey-another clade
Fig. 8.2 The relationship between 3 animals shown as a branch of a tree
Methods
Three methods – maximum parsimony, distance and maximum likelihood –
are generally used to find the evolutionary tree or trees that best account for the
observed variation in a group of sequences.
Maximum Parsimony Method

Maximum parsimony method (minimum evolution method) predicts the
evolutionary tree that minimizes the number of steps required to generate the
observed variation in the sequences. A multiple sequence alignment is
required to predict which sequence positions are likely to correspond. These
positions will appear in vertical columns in the multiple sequence alignments.
For each aligned position, phylogenetic trees that require the smallest
number of evolutionary changes to produce the observed sequence changes are
identified. This analysis is continued for every position in the sequence
alignment. Finally, those trees that produce the smallest number of changes
overall for all sequence positions are identified. Maximum parsimony method
is used to construct trees on the basis of the minimum number of mutations
required to convert one sequence into another. The main programs for
maximum parsimony analysis in the PHYLIP package are DNAPARS,
DNAPENNY, DNACOMP, DNAMOVE and PROTPARS.
Distance Method
In distance matrix methods, all possible sequence alignments are carried out to
determine the most closely related sequences, and phylogenetic trees are
constructed on the basis of these distance measurements.
The distance method employs the number of changes between each pair
in a group of sequences to produce a phylogenetic tree of the group. The
sequence pairs that have the smallest number of sequence changes between
them are termed ‘neighbors’. On a tree, these sequences share a node or
common ancestor position and are each joined to that node by a branch.
The goal of distance methods is to identify a tree that positions the
neighbors correctly and that also has branch lengths which reproduce the
original data as closely as possible. The success of distance methods depends
on the degree to which the distances among a set of sequences can be made
additive on a predicted evolutionary tree. The most commonly applied
distance based methods are the unweighted pair group method with
arithmetic mean (UPGMA), neighbor-joining (N) and methods that optimize
the additivity of a distance tree, including the minimum evolution (ME)
method. Distance analysis programs in PHYLIP are FITCH, KITCSCH and
NEIGHBOR.
Maximum Likelihood Method

The maximum likelihood method uses probability calculations to find a tree
that accounts best for the variation in a set of sequences. This method is similar
to the maximum parsimony method in that the analysis is performed on each
column of a multiple sequence alignment. All possible trees are considered.
For each tree, the number of sequence changes or mutations that may
have occurred to give the sequence variation is considered. Because the rate of
appearance of new mutations is very small, the more mutations needed to fit a
tree to the data the less likely that tree. Trees with the best number of changes
will be the most likely.
Maximum likelihood method incorporates an expected model of
sequence changes and weighs the probability of any residue being converted
into any other. PHYLIP includes two programs such as DNAML and
DNAMLK for maximum likelihood analysis.
Criteria for Phylogenetic Analysis

For phylogenetic analysis, many different criteria can be used such as,
morphological characteristics, biochemical properties and data from nucleic
acid and protein sequences. Nucleic acid and protein sequence data are very
useful for comparison because they provide a large and unbiased data set,
which extends across all known organisms, allowing the comparison of both
closely related and distantly related taxa.
The relatedness between sequences is usually quantified objectively
using sequence alignment algorithms. Macromolecules, especially sequences,
have surpassed morphological and other organismal characters as the most
popular form of data for phylogenetic or cladistic analysis.
Steps in Phylogenetic Analysis

Phylogenetic analysis consists of four steps:
(i) Alignment (both building the data model and extracting a phylogenetic
dataset)
(ii) Determining the substitution model
(iii) Tree building
(iv) Tree evaluation
Bootstrap
Bootstrapping is a reassembling tree evaluation method that works with
distance, parsimony, likelihood and with any other tree derivation method.
The result of bootstrap analysis is typically a number associated with a
particular branch in the phylogenetic tree that gives the proportion of
bootstrap replicates that support the monophyly of the clade.
Bootstrapping can be considered a two-step process comprising the
generation of new data sets from the original set and the computation of a
number that gives the proportion of times that particular branch appeared in
the tree. That number is commonly referred to as bootstrap value. Bootstrap
value is considered to be a measure of accuracy. Based on simulation studies it
has been suggested that under favourable conditions (roughly equal rates of
change, symmetric branches), bootstrap values greater than 70% correspond to
a probability of greater than 95% that the true phylogeny has been found.
Jackknife is another technique like bootstrap. Parametric bootstrap uses
simulated but actual replicates. It can be used in conjunction with any tree
building method.
8.2.2 Phylogenetic Trees

Usually phylogenetic relationships are described as trees (dendrogram). The
clearest way to visualize the evolutionary relationships among organisms is to
use a graph. A graph is a simple diagram (abstract structure) used to show
relationships between entities, such as numbers, objects or places. Entities are
represented by nodes and relationships between them are shown as links or
edges (connecting lines). In phylogenetic trees, nodes represent different
organisms and links are used to show lines of descent.
In computer language, a tree is a particular kind of graph. A graph is a
structure containing nodes (abstract points) connected by edges (represented
as lines between the points). A path from one node to another is a consecutive
set of edges beginning at one point and ending at the other. A connected graph
is a graph containing at least one path between any two nodes. A tree is a
connected graph in which there is exactly one path between every two points.
A particular node may be selected as a root. Abstract trees may be rooted
or unrooted. Unrooted trees show the topology of relationship but not the
pattern of descent. A rooted tree in which every node has two descendants is
called binary tree. Another special kind of graph is a directed graph in which
each edge is a one-way street. Rooted phylogenetic trees are implicitly, directed
graphs, the ancestor–descendent relationship implying the direction of each
edge (Fig. 8.3).
(a) Human
Primate Chinpanzee
Gorilla
(b)
Eukarya
LCA Archaea
Bacteria
Fig. 8.3 Rooted trees for (a) three great apes with an unspecified primate ancestor, and
(b) the three major forms of life on this planet. Archaea were previously called the archae
bacteria. Bacteria were previously called eubacteria and by Eukarya we refer to the nuclear-
cytoplasmic system in eukaryotes (Organelles are ignored). LCA is the last common ancestor
of all life on this planet. (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios
Scientific Publishers Ltd., 2003)
It may be possible to assign numbers to the edges of a graph to signify, in

some sense, a ‘distance’ between the nodes connected by the edges. The graph
may then be drawn to scale, with the sizes of the edges proportional to the
assigned lengths. The length of a path through the graph is the sum of the edge
lengths. In phylogenetic trees, edge lengths signify either some measure of the
dissimilarity between two species, or the length of time since their separation
(Figure 8.4).
Echinoderms (Starfish)
Deuterostomes Urochordates (Tunicate worms)
Cephalochordates (Amphioxus)
Jawless fish (Lamprey, Hagfish)
Cartilaginous fish (Shark)
Bony fish (Zebrafish)
Amphibians (Frog)
Mammals (Human)
Reptiles (Lizard)
Birds (Chicken)
Fig. 8.4 Phylogenetic tree of vertebrate and our closest relatives. Chordates, including
vertebrates, and echinoderms are all deuterostomes (Source: Lesk, A.M., Introduction to
Bioinformatics, Oxford University Press, 2003).
Special Features of Trees
Trees have some special features:
(i) Nodes are of two types – ancestral and terminal (leaves, tips). Ancestral
nodes may or may not correspond to a known species. Ancestral nodes
give rise to branches. They may link to other ancestral nodes, or they
may link to terminal nodes which represent known species. Terminal
nodes mark the end of the evolutionary pathway.
(ii) Trees may be rooted or unrooted. When the position of the ancestor is
indicated, it is called rooted tree. When the position of the ancestor is
not indicated, it is called unrooted tree.
(iii) Each tree is binary. Evolution of species is represented as a series of
bifurcations.
(iv) The length of the branches may or may not be significant.
8.2.3 Tree-building Methods

Tree-building methods can be grouped into distance-based and character
based methods.
Distance-based methods
Distance-based methods compute pairwise distances according to some
measure and then discard the actual data, using only the fixed distances to
derive trees that optimize the distribution of the actual data patterns for each
character. Here the pairwise distances are not fixed but they are determined by
the tree topology.
Distance-based methods use the amount of dissimilarity (distance)
between two aligned sequences to derive trees. A distance method would
reconstruct the true tree if all genetic divergence events were accurately
recorded in the sequence. However, divergence encounters an upper limit as
sequences become mutationally saturate. Unweighted Pair Group Method
with Arithmetic Mean (UPGMA) is a clustering or phenetic algorithm. It joins
tree branches based on the criterion of greatest similarity among pairs and
averages of joined pairs.
Neighbor Joining (NJ) algorithm is commonly applied with distance tree
building, regardless of the optimization of criterion. Fitch-Margobiash (FM)
method seeks to maximize the fit of the observed pairwise distances to a tree by
minimizing the squared deviation of all possible observed distances relative to
all possible path lengths on the tree. Minimum Evolution (ME) method seeks to
find the shortest tree that is consistent with the path lengths measured in a
manner similar to FM.
Character-based Methods
The character-based methods use character data at all steps in the analysis.
This allows the assessment of the reliability of each base position in an
alignment on the basis of all other base positions. The principle of maximum
parsimony (MP) method is to search for a tree that requires the smallest
number of changes to explain the differences observed among the taxa under
study. The MP method defines an optimal tree as the one that postulates the
fewest mutations.
The principle of maximum likelihood (ML) method is to assume that
changes between all nucleotides (or amino acids) are equally probable leading
to reconstructions of likelihoods. ML method assigns quantitative probabilities
to mutational events rather than merely counting them. For each possible tree
topology, the assumed substitution rates are varied to find the parameters that
give the highest likelihood of producing the observed sequences. The optimal
tree is the one with the highest likelihood of generating the observed data.
Models
Phylogenetic tree-building methods presume particular evolutionary models.
Models inherent in phylogenetic methods have some important assumptions:
1. The sequence is correct and originates from a specified source.
2. The sequences are homologous (i.e. all are descended in some way from
a shared ancestral sequence).
3. Each protein in a sequence alignment is homologous with every other
in that alignment.
4. Each of the multiple sequence included in a common analysis has a
common phylogenetic history with the others (e.g. there are no mixtures
of nuclear and organellar sequences).
5. The sampling of taxa is adequate to resolve the problem of interest.
6. Sequence variation among the samples is representative of the broader
group of interest.
7. The sequence variability in the sample contains phylogenetic signal
adequate to resolve the problem of interest.
Similarity Table and Distance Table

Phylogenetic trees can be constructed from either similarity tables or distance
tables, which show the resemblance among organisms for a given set of
characters (Fig. 8.5). Usually the numbers in a similarity table show the
percentage of matches. Such data form the basis to adansonian analysis or
numerical taxonomy. The numbers in the distance table show percentage of
differences.
Some of the most commonly used methods for tree building in
phylogenetic analysis involves agglomerative hierarchical clustering based on
distance matrices. The essential basis for this type of algorithm is that the taxa
represented in a distance table are merging two taxa together in each step until
only one cluster remains. There are other distance matrix algorithms such as
single linkage, complete linkage, average linkage and centroid method.
(a) (b)
a b c d e a b c d e
a 100 65 50 50 50 a 0 6 11 11 11
b 65 100 50 50 50 b 6 0 11 11 11
c 50 50 100 97 65 c 11 11 0 2 6
d 50 50 97 100 65 d 11 11 2 0 6
e 50 50 65 65 100 e 11 11 6 6 0
Fig. 8.5 Hypothetical (a) similarity table and (b) distance table for five organisms,
a-e. (Source: D.R. Westhead et al., Instant Notes: Bioinformatics, Bios Scientific Publishers
Ltd., 2003).
Aligning According to Sequence and Structure

As more genomes are sequenced, we are interested to learn more about protein
or gene evolution. Studies of protein and gene evolution involve the
comparison of homologs, i.e., sequences that have common origin but may or
may not have common activity.
The simple principle behind the phylogenetic analysis of sequences is
that the greater the similarity between two sequences, the fewer mutations are
required to convert one sequence into the other, and thus they shared a
common ancestor more recently.
Phylogenetic sequence data usually consist of multiple sequence
alignments. The individual, aligned-base positions are commonly referred to
as sites. These sites are equivalent to character in theoretical phylogenetic
discussions and the actual base (or gap) occupying a site is the character state.
Aligned sequence positions subjected to phylogenetic analysis represent
a priori phylogenetic conclusions because the sites themselves (not the actual
bases) are effectively assumed to be genealogically related or homologous.
Steps in building the alignment include selection of the alignment
procedure and extraction of a phylogenetic data set from the alignment. A
typical alignment procedure involves the application of program such as
CLUSTAL W, followed by manual alignment editing and submission to a tree-
building program.
Aligning according to secondary or tertiary sequence structure is
considered phylogenetically more reliable than sequence-based alignment
because confidence in homology assessment is greater when comparisons are
made to complex structures rather than to simple characters (primary
sequence).
Multiple Sequence Alignment and Phylogenetic Tree construction using
ClustalX.
The clustal series of programs are widely used in molecular biology for
the multiple sequence alignment of both nucleic acids and protein sequences
and for preparing phylogenetic trees. The first Clustal program was written by
Des Higgins in 1988. It was designed specifically to work efficiently on
personal computer. It has now given rise to a number of developments,
including ClustalX.
ClustalX is a windows interface for the clustalW. It provides an
integrated environment for performing multiple sequence and profile
alignments and analyzing the results. The program displays the multiple
alignment in a scrollable window and all parameters are available using pull-
down menus. Within alignments, conserved columns are highlighted using a
customizable color scheme and quality analysis tools are available to highlight
potentially misaligned regions.
ClustalX is easy to install. It is user-friendly. It maintains the portability
of the previous generations through NCBI vibrant toolkit (ftp://
ncbi.nlm.nih.gov/toolbox/ncbitools/). Numerous options such as the
realignment of selected sequences or selected blocks of the alignment and the
possibility of building up difficult alignments piecemeal are available. It
includes other features such as NEXUS and FASTA format output, printing
range numbers and faster tree calculation. The accuracy of the results,
robustness, portability and user-friendliness of the program are attractive
features.
ClustalX can be downloaded from PCBLAB Bioinformatics links using
the following URL address: http://www-igbmc.U-stasbg.fr/BioInfo/
ClustalX/Tophtml. After it is downloaded click for the ClustalX package,
double click the ClustalX folder, open Blue navigation menu and click the
menu ClustalX. It will appear on the window.
ClustalX is available for a number of platforms such as SUN Solaris,
IRIX5.3 on Silicon Graphics, Digital UNIX on DECStations, Microsoft
windows (32 bit) for PCs, LINUS ELF for X 86 PCs, and Macintosh Powermac.
Procedure
The following steps can be followed to align sequences and construct the
phylogenetic tree using ClustalX:
1. Open ClustalX
2. Load sequence saved in the FASTA format (Entrez session) using the
file menu. Click the ClustalX yellow logo, click file> load the
sequence>enter. The dialogue box will appear. Give correct path, open
the sequence file and enter.
3. Scroll the match without alignment
4. Go to the alignment menu and click do complete alignment click > do
complete alignment>
5. Save the alignment files (*.dnd and *.aln)
6. Scroll again and see matches by noting the symbol code and the
histogram
7. Go to trees menu and click Tree’ then select >Draw N-J Tree. It will
create a tree file with .Ph extension. This file opens with NJ Plot.
8. Save the resultant tree file (*.ph)
9. Close ClustalX
10. Open NJ Plot
11. Open the tree constructed using ClustalX (*.ph)
12. Observe the phylogenetic relationship between the sequences.
8.3 MOLECULAR APPROACHES TO PHYLOGENY

Molecular approaches to phylogeny developed against a background of
traditional taxonomy. Many molecular properties have been used for
phylogenetic studies. In 1967, based on immunological data, V.M. Sarich and
A.C. Wilson announced that the divergence of humans from chimpanzees took
place 5 million years ago (Fig. 8.6). This was in contrast to paleontologists who
dated the split at 15 million years ago. In 1909, E.T. Reichert and A.P. Brown
published a phylogenetic analysis of fishes based on hemoglobin crystals.
Human beta
Horse beta
Chimp Alpha
Human beta
Chimp beta
Horse beta
Human Alpha
Chimp Alpha
Horse Alpha
Fig. 8.6 Two trees generated from hemoglobin sequences from human, chimpanzee and
horse. The lower tree is correct, indicating the correct phylogeny for both α and
β hemoglobin chains. The upper tree is confusing because it is formed from human and
horse β chains and the chimpanzee α chain, creating impression that horse is closer to
human than chimpanzee (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios
Scientific Publishers Ltd., 2003)
Today, DNA sequences provide the best measures of similarities among

species for phylogenetic analysis. The data are digital. It is even possible to
distinguish selective from non-selective genetic change, using the third
position in codons or untranslated regions as pseudogenes, or the ratio of
synonymous to non-synonymous codon substitutions. Many genes are
available for comparison. Given a set of species to be studied, it is necessary to
find genes that vary at an appropriate rate. Genes that remain almost constant
among the species of interest provide no discrimination of degrees of
similarity. Genes that vary too much cannot be aligned.
Molecular phylogenies are very informative compared to those based on
traditional or morphological characters because they are wider in scope. (It is
possible to compare flowering plants and mammals using protein sequences,
but not using morphological characters) and data handling is consistent and
objective.
Macromolecular Sequences
Different macromolecular sequences evolve at different rates, even sequences
in different regions of the same molecule. Residues in an RNA or protein that
have a critical structural or functional role in the molecule can accommodate
mutations less easily than those in other regions. The rate at which a
particular sequence evolves depends largely on the proportion of residues
whose substitution would adversely affect normal structure and function.
Mitochondrial DNA
A useful macromolecular sequence for the study of primates is mitocondrial
DNA (mtDNA). As a consequence of respiratory metabolism, there is a higher
concentration of active oxygen species (such as superoxide and the hydroxyl
radical) in the mitochondria than in the nucleus and consequently a higher
chance of oxidative chemical lesions in mitochondrial DNA. Further, the
mtDNA polymerase is more error-prone than the nuclear enzyme. Therefore,
mtDNA evolves more quickly than nuclear DNA due to an increased intrinsic
mutation rate.
There is a short noncoding region in primate mtDNA where selective
constrains are low, since point mutations tend not to affect mitochondrial
function. This particular sequence evolves at a suitable rate to study primate
phylogeny. The tree in Fig. 8.3 is consistent with the alignment and clustering
of this region, and with such analyses of coding genes in mtDNA.
Ribosomal RNA
Ribosomal RNA (rRNA) is a highly conserved ubiquitous molecule in all
living organisms (animals, plants, fungi, bacteria, parasites, etc.). It has a low
tolerance for mutations and evolves very slowly. The abundant secondary
structure of rRNA insures that the rate of evolutionary change is slow, since
compensating base changes are required in double helical regions. The tree in
Figure 8.7 is consistent with the alignment and clustering of this molecule and
the conclusions are compatible with those of other macromolecular studies.
Bacteria Archaea Eukarya

Extreme Animals
Green non- halophiles Slime
sulphur bacteria Methanobacterium Entamoebae molds Fungi
Gream-positive Plants
bacteria Methanococcus Thermoplasma Ciliates
Purple bacteria
Pyrodictium Thermococcus
Cyanobacteria
Flagellates
Flavobacteria Termoproteus Trichomonads
Thermotoga
Diplomonads
Aquifex
Fig. 8.7 Major division of living things, derived by C. Woese on the basis of 15s RNA
sequences (Source: Lesk, A.M., Introduction to Bioinformatics, Oxford University Press)
8.4 PHYLOGENETIC ANALYSIS DATABASES

PAUP (Phylogenetic Analysis Using Parsimony) and PHYLIP (Phylogenetic
Inference Package) are versatile programs for phylogenetic analysis. PAUP
provides a phylogenetic program that includes as many functions (including
tree graphics) as possible in a single, platform—independent program with a
menu interface.
PHYLIP consists of about 30 programs that cover most species of
phylogenetic analysis. It is a command-line program; it does not have a point-
and-click interface. The interface is straightforward.
PHYLIP (Phylogeny and ALIgnment of homologous protein structures) is
a database containing 3D structure based sequence alignments and structure
based phylogenetic trees of homologous protein domains in protein families.
Two types of dendrograms are used to represent the relationships – one is
based on a structural dissimilarity metric defined for pairwise alignment
(sequence based) and the other is based on similarity of topologically
equivalent residues (structure based). SUPFAM is a database of potential
superfamily relationships derived by comparing sequence-based and
structure-based families. PASS2 is a semi-automated database of Protein
Alignment organized as Structural Superfamilies.
STUDY QUESTIONS
1. What was the observation of Charles Darwin in Galapages finches?
2. How do you distinguish homology and similarity?
3. How do you distinguish ortholog, paralog and xenolog?
4. What are modules?
5. What is phylogeny?
6. What is phenetic approach?
7. What is the special features of cladistics?
8. What is a node?
9. What is phylogenetic tree?
10. What is a rooted and unrooted tree?
11. What are the special features of phylogenetic tree?
12. What are the presumptions of phylogenetic tree-building?
13. What are the different methods used in phylogenetics?
14. How is molecular phylogenetics superior to traditional phylogenetics?
15. What are databases used in phylogenetic analysis?
16. What is bootstrapping?
C H A P T E R
Drug Discovery and

9
Pharmainformatics
A drug is a molecule that interacts with a target biological molecule in the

body and through such interaction triggers a physiological effect. The target
molecules are usually proteins. Drugs can be beneficial or harmful depending
on their effect. The aim of pharmaceutical industry is to discover drugs with
specific beneficial effects to treat diseases especially in humans.
A chemical compound to qualify as a drug should have the following
characteristics: It should be safe, effective, stable (both chemically and
metabolically), deliverable (should be absorbed and make its way to its site of
action), available (by isolation from natural sources or by synthesis) and novel
(patentable).
9.1 DISCOVERING A DRUG

Discovering a drug can be arrived at by two methods: the empirical and the
rational. The empirical method is a blind hit or loose method; it is also called
black box method. Thousands of chemical compounds are tested on the
disease without even knowing the target on which the drug acts and the
mechanism of action. Occasionally a serendipitous discovery like the
discovery of Penicillin may come up.
Approaches
Usually thousands of chemical compounds are tested for drug action. One out
of 10,000 may hit the target. In this type of approach, no one knows initially
which target the drug attacks and the mechanism involved in the attack.
Rational approach starts from the clear knowledge of the target as well as the
mechanism by which it is to be attacked. Drug discovery involves finding the
target and arriving at the lead. Target refers to the causal agent of the disease
and lead refers to the active molecule which will interact with the causal agent.
When diseases are treated with drugs they interact with targets that
contribute to the disease and try to control their contribution thus producing
positive effects. The disease target may be endogenous (a protein synthesized
by the individual to whom the drug is administered) or, in the case of
infectious diseases, may be produced by a pathogenic organism. Drugs act
either by stimulating or blocking the activity of the target protein.
9.1.1 Target Identification and Validation

Developing a drug is not that easy. It is a complex, lengthy and expensive
process. Drug development begins with the identification of a potentially
suitable disease target. This process is called target identification. One has to
study what is known about diseases, possible causes, its symptoms, its
genetics, its epidemiology, its relationship to other diseases – human and
animal – and all known treatments.
The biology of the disease (cause of illness, the spread of the disease in
the population, the development of the disease inside the patient, the
biochemical and physiological changes in the patients, etc.) has to be
ascertained. In the past, target identification was based largely on medical
need. Presently, target identification depends not only on medical need but
also on factors such as the success of existing therapies, the activity of
competing drug companies and commercial opportunities.
Types of Targets
The targets for the drugs are usually the biomolecules, such as enzymes,
receptors or ion channels. The validity of the enzyme as a target depends upon
how much important it is for the survival of the pathogen. If it is less
significant, then the target has no value. If the drug target is located inside the
human system, the fluctuation of the target activity must correspond to the
fluctuation of the disease severity. Only when we are able to establish a high
level of significance in the regulation of the target for effective disease control,
the target will have relevance to the disease. Once the target is confirmed, we
can identify the modulators of the target. There are positive modulators and
negative modulators (Table 9.1).
Table 9.1: List of positive and negative modulators
Biomolecules Positive modulators Negative modulators
Enzymes Activators Inhibitors
Receptors Agonists Antagonists
Ion Channels Openers Blockers
Validation
Once the target is identified, it has to be validated. This process is called target
validation. It involves extensive testing of the target molecule’s therapeutic
potential. Validation may include the creation of animal disease models, and
the analysis of gene and protein expression data. By comparing the levels of
gene expression in normal and disease states, novel drug targets can be
identified in silico. Micro array technique can be used in this.
Drug Discovery and Pharmainformatics 9.3
Once the gene which is ‘up or down regulated’ (expressed in higher or
lower level than in normal tissue) in a disease state is identified, its nature can
be identified using bioinformatic tools. Similar genes or proteins can be traced
using BLAST from the sequence database. Similar genes and proteins will help
to deduce the function of the up or down regulated gene. If the target happens
to be one of a highly tractable structure class (such as receptors, enzymes or
ion channels), the drug designing will be easier.
A valid target must have a high therapeutic index, that is, a significant
therapeutic gain must be predicted through the use of such a drug. If a known
protein is the target, binding can be measured directly. A potential anti-
bacterial drug can be tested by its effect on growth of the pathogen. Some
compounds might be tested for effects on eukaryotic cells grown in tissue
culture. If a laboratory animal is susceptible to the disease, compounds can be
tested on animals.
Characters
If the target happens to be an enzyme, the following characters are studied: the
active site, the amino acids associated in the formation of active site, presence
or absence of metal component, number of hydrogen donors and acceptors
present in the active site, the topology of the active site, and the details about
hydrophobic and hydrophilic amino acids present in the active site.
If the target happens to be a biochemical substance or a substrate of an
enzyme, the following details are collected: size of the molecule, chemical
nature, groups that show hydrogen donor or acceptor capacity, its metabolic
byproducts and how this compound can be modified chemically.
9.1.2 Identifying the Lead Compound

Once a target has been validated, the search begins for drugs that interact with
the target. This process is called lead discovery, and involves the search for
lead compounds, that is, substances with some of the desired biological
activity of the ideal drug.
Qualities
A lead molecule should have the following desirable qualities: (a) the potency
(able to modulate the target effectively), (b) solubility (it should be easily
soluble in water for quicker action), (c) a milder lipophilicity (ability to
penetrate plasma membrane), (d) metabolic stability (should not get destroyed
quickly inside the body; a longer shelf life is desirable), (e) bioavailability
(quicker absorption into the body and at the same time retained for longer time
for sustained activity), (f) specific protein binding, (g) less toxic or not at all
toxic.
Finding Compounds
Lead compounds can be found using some of the following ways:
(i) Serendipity – through chance observations (discovery of penicillin by
Alexander Fleming).
(ii) Survey of natural sources – from traditional medicines (quinine from
Chincona bark).
(iii) Study of what is known about substrates or ligands or inhibitors and
the mechanism of action of the target protein, and select potentially
active compounds from these properties.
(iv) Trying drugs effective against similar diseases
(v) Large-scale screening of related compounds
(vi) Occasionally from side effects of existing drugs.
(vii) Screening of thousands of compounds.
(viii) Computer screening and ab initio computer design.
9.1.3 Optimization of Lead Compound

Once a lead compound is found, it must be optimized. Lead optimization
involves the modification of lead compounds to produce derivatives which are
called candidate drugs with better therapeutic profiles. For example,
deliverability of a drug to a target within the body requires the capacity to be
absorbed and transmitted. It requires metabolic stability. It requires the proper
solubility profile – a drug must be sufficiently water – soluble to be absorbed,
but not so soluble that it is excreted immediately; it must be sufficiently lipid-
soluble to get across membranes, but not so lipid-soluble that it is merely taken
up by fat stores.
Once this is done the candidate drugs are assessed for quality, taking
into account factors such as the ease of synthesis and formulation. After this,
they are registered as an investigational new drug and submitted for clinical
trials. This is the lengthiest and most expensive part of the drug development
process. Due to this most projects are abandoned before this stage. Clinical
trials are designed to determine safety and tolerance levels in humans, and to
discover how the drug is metabolized. Trials are divided into several stages.
Stages
Trials are dived into several stages
Pre-clinical phase: Studies using animals
Phase I: Normal (healthy) human volunteers
Phase II: Evaluation of safety and efficacy in patients, and selection of dose regimen
Phase III: Large patient number study with placebo or comparator; at this stage regulatory approval is sought
and a commercial launch decision is taken
Phase IV: Long-term monitoring for adverse reactions reported by pharmacists and doctors.
Other Inputs
Drug development has been benefiting much from genomics, proteomics,
combinational chemistry and high-throughput screening. Genomics and
proteomics have revolutionized the way target molecules are identified and
validated. Traditionally, drug targets have been characterized on an
individual basis and lead compounds have been sought with specific clinical
effects.
With the advent of genomics, particularly the availability of the entire
human genome sequence and its annotations, thousands of potential new
targets can now be identified by sequence, structure and function.
Bioinformatics is important not only because of its role in the analysis of
sequences and structures, but also in the development of algorithms for the
modeling of target protein interactions with drug molecules. This allows
rational drug design, in which protein structural data is used to predict the
type of ligands that will interact with a given target, and thus form the basis of
lead discovery.
Of late systematic methods are used to identify lead compounds. These
methods are based on high throughput screening in which lead discovery is
accelerated through the use of highly parallel assay formats, such as 96-well
plates. In turn, this requires the assembly of large chemical libraries for testing.
This has been made possible by combinational chemistry approaches, in
which large numbers of different compounds can be made by pooling and
dividing materials between reaction steps.
9.2 PHARMAINFORMATICS
The term pharmainformatics is often used to describe the mix of biology,
chemistry, mathematics and information technology required for data
processing and analysis in the pharmaceutical industry. The scope of
pharmainformatics is summarized in Table 9.2.
Table 9.2: Areas of biology and chemistry where informatics plays a vital role in the
drug discovery pipeline.
Application Role of Bioinformatics
Biology
Genomics proteomics Target identification, validation in the human genome
(human genome project)
Characterization of human Cataloguing single nucleotide polymorphisms, and
genes and proteins association with drug response
patterns (pharmacogenomics)
Genomics, proteomics Target identification, validation in pathogens
(human pathogen genome projects).
Characterization of the genes and
proteins of organisms that are
pathogenic to human
Contd...
Functional genomics (protein structure) Prediction of drug/ target interactions
Analysis of protein structures Rational drug design
(human and their pathogens)
Functional genomics (expression profiling) Gene classification based on drug responses
Determining gene expression patterns Pathway reconstruction
in disease and health
Functional genomics Databases of animal models
(genome-wide mutagnesis)
Determining the mutant phenotypes Target identification, validation
for all genes in the genome
Functional genomics (protein interactions) Characterization of protein interactions
Determining interactions among all proteins Reconstruction of pathways
Prediction of binding sites.
Chemistry
High throughput screening Storing, tracking and analyzing data
Highly parallel assay formats
for lead identification
Combinational chemistry Cataloguing chemical libraries.
Synthesis of large number of Assessing library quality, diversity
chemical compounds Predicting drug, target interactions
9.2.1 Chemical Libraries and Search Programs

High throughput screening in drug discovery depends on the availability of
diverse chemical libraries, such as those generated by combinational
chemistry, since these maximize the chances of finding molecules that interact
with a particular target protein. It is not easy to quantify chemical diversity.
Attempts have been made to understand this based on the concept of ‘chemical
space’. In essence, chemical space encompasses molecules with all possible
chemical properties in all possible molecular positions. A diverse library
would have broad coverage of chemical space, leaving no gaps and having no
clusters of similar molecules.
Tanimoto Coefficient
Usually library diversity is quantified using measures that compare the
properties of different molecules based on descriptors such as atomic position,
charge and potential to form different types of chemical bond. We can compare
two molecules using the Tanimoto coefficient (Tc), which evaluates the
similarity of fragments of each molecule.
The coefficient is calculated by the formula Tc = c/(a + b - c), where a is
the number of fragment–based descriptors in compound A, b is the number of
fragment-descriptors in compound B, ad c is the number of shared fragment-
based descriptors. Hence, for identical molecules, Tc = 1, while for molecules
with no descriptors in common, Tc = 0. In a chemical library of ideal diversity,
most-pairwise comparisons would generate a Tanimoto coefficient near to
zero.
Pharmacophore
When we do not know much about the binding specificity of the target protein,
diverse libraries will be useful for lead discovery. When only some form of
sequence or structural information is available for the target, this can be used
to design focused libraries that concentrate on one region of chemical space.
For example, if the sequence of a particular target protein is known, then
database homology searching will often find a related protein whose structure
has been solved and whose interactions with small molecules have been
characterized. In these cases, it is possible to design a chemical library based
on particular molecular scaffold, which preserves a framework of sites present
in a known ligand, but which can be modified with diverse functional groups.
Some of these groups may have previously been shown to be important for
drug binding. Such sites are known as pharmacophores.
Tools
Many tools and resources are available for the design of combinatorial
libraries and the assessment of chemical diversity. A program called Selectors,
available from Tripos, allows the user to design very diverse libraries or
libraries focused on a particular molecular skeleton. Chem-x, developed by the
Oxford Molecular Group, allows the chemical diversity in a collection of
compounds to be measured and identifies all the pharmacophore.
ComibiLibMaker, another Tripos program, allows a virtual target.
9.3 SEARCH PROGRAMS

Before starting laboratory-based screening experiments, it is always better to
generate as much information as possible about potential drug/ target
interactions. The computational screening of chemical databases, using a
target molecule of known structure, is one way in which such information can
be obtained. Alternatively, the solved structure of a close homology may be
used, or the structure may be predicted using a threading algorithm.
Algorithms can be used to identify potential interacting ligands based on
goodness of fit, if the structure of a target protein is known, thus allowing
rational drug design.
Already many docking algorithms have been developed which attempt to
fit small molecules into binding sites using information on stearic constraints
and bond energies (Table 9.3).
Table 9.3: Chemical docking software available over the internet freely
URL R/F Description Availability
http://www.scripts.edu/pub/olson F Autodock Download for UNIX/LINUX

-web/dock.autodock/index.html
http://swift.embl- R LIGIN, a robust ligand- Download for UNIX or as
heidelberg.de/lignin/ protein interaction prediction apart of the WHATIF
limited to small ligands package
http://www.bmm.icnet. R FTDock and associated Download for UNIX/LINUX
uk/docking/ programs.
RPScore and mMultiDock,
can deal with protein-protein
interactions. Ralies on a
Forier transform library
http://reco3.musc.edu/gramm/ R GRAMM (Global Range Download for UNIX or
Molecular Matching) an Windows
empirical method based
on tables of inter-bond
angles. GRAMM has the
merit of coping with
low-quality structures.
http://cartan.gmd.de/flex- F FlexX, which calculates Apply on-line for FlexX
bin/FlexX favorable molecular Workspace on the server
complexes consisting of the
ligand bound to the active
site of the protein,
and ranks the output.
Note: R means Rigid; F means Flexible; they indicate whether the program regards the ligand as a rigid or
flexible molecule.
Docking Algorithms
One of the most established docking algorithms is autodock. Another widely
used program is DOCK. Another program is CombiDOCK. In DOCK, the
arrangement of atoms at the binding site is converted into a set of spheres
called site points. The distances between the spheres are used to calculate the
exact dimensions of the binding site, and this is compared to a database of
chemical compounds. Matches between the binding site and a potential ligand
are given a confidence score, and ligands are then ranked according to their
total scores.
In combiDOCK, each potential ligand is considered as a scaffold
decorated with functional groups. Only spheres on the scaffold are initially
used in the docking prediction and then individual functional groups are
tested using a variety of bond torsions. Finally it is bumped before a final score
is presented.
Chemical databases can be screened not only with binding site
(searching for complementary molecular interactions) but also with another
ligand (searching for identical molecular interactions). Several available
algorithms can compare two-dimensional or three-dimensional structures and
build a profile of similar molecules.
The three dimensional structure (3D) of the target is a prerequisite (X-ray
crystallography, nuclear magnetic resonance imaging) for designing a
compound that can bind or act on it. The compound is chosen from existing
chemical compound library by the combinatorial structure docking. The lead
compounds from the library are docked or tried by complementary fixing onto
the active site of the target molecule. This initial in silico fixing reduces the
number of compounds that have to be synthesized and tested in vitro, since the
databases contain the chemical property and method of synthesis of the
compounds.
In addition there are a few other commercial docking and molecular
modeling softwares which are described below:
Schroedinger
Schroedinger Software is a suite of computational tools specializing in
research for computational chemistry, docking, homology modeling, protein x-
ray crystallography refinement, bioinformatics, ADME prediction,
cheminformatics, enterprise informatics, pharmacophore searching, molecular
simulation, and quantum mechanics to solve real-world problems in life
science and molecular chemistry research. Maestro is the unified interface for
all Schroedinger software. Impressive rendering capabilities, a powerful
selection of analysis tools, and an easy-to-use design combine to make Maestro
a versatile modeling environment for all researchers. It can be used to build,
edit, run and analyse molecules.
The main comments are OPLS-AA, MMFF, GBSA solvent model,
conformational sampling, minimization, MD that includes the Maestro GUI
which provides visualization, molecule building, calculation setup, job
launching and monitoring, project-level organization of results and access to a
suite of other modeling programs (http://www.schrodinger.com/).
Molsoft
Molsoft is a leading provider of tools, databases and consulting services in the
area of structure prediction, structural proteomics, bioinformatics,
cheminformatics, molecular visualization and animation, and rational drug
design. Molsoft offers complete solutions customized for a biotechnology or
pharmaceutical company in the areas of computational biology and chemistry.
Molsoft is committed to continuous innovation, scientific excellence, the
development of the cutting edge technologies and original ideas. Molsoft is a
Powerful global optimizer in an arbitrary subset of internal variables, NOEs,
Protein docking, Ligand docking, Peptide docking, EM and Density placement
(http://www.molsoft.com/).
Discovery Studio
Discovery Studio is a well-known suite of software for simulating small
molecule and macromolecule systems. It is developed and distributed by
Accelrys, a company that specializes in scientific software products covering
computational chemistry, computational biology, cheminformatics, molecular
simulations and Quantum Mechanics. It is typically used in the development
of novel therapeutic medicines, including small molecule drugs, therapeutic
antibodies, vaccines, synthetic enzymes, and even in areas such as consumer
products. It is used regularly in a range of academic and commercial entities,
but is most relevant to Pharmaceutical, Biotech, and consumer goods
industries.
The product suite has a strong academic collaboration programme,
supporting scientific research and makes use of a number of software
algorithms developed originally in the scientific community, including
CHARMM, MODELLER, DELPHI, ZDOCK, DMol3 and more (http://
accelrys.com/products/discovery-studio/).
GOLD - Protein-Ligand Docking

GOLD is a program for calculating the docking modes of small molecules in
protein binding sites and is provided as part of the GOLD Suite, a package of
programs for structure visualization and manipulation (Hermes), for protein-
ligand docking (GOLD) and for post-processing (GoldMine) and visualization
of docking results.
The product of collaboration between the University of Sheffield,
GlaxoSmithKline plc and CCDC, GOLD is very highly regarded within the
molecular modeling community for its accuracy and reliability. It is mainly
used for calculating docking modes of small molecules into protein binding
sites, genetic algorithm for protein-ligand docking, full ligand and partial
protein flexibility, energy functions partly based on conformational and non-
bonded contact information from the CSD, choice of scoring functions: Gold
Score, ChemScore and User defined score and virtual library screening (http:/
/www.ccdc. cam.ac.uk/products/life_sciences/gold/).
VLifeMDS
VLifeMDS is a comprehensive and integrated software package for computer
aided drug design and molecular drug discovery process. This integrated suite
provides complete toolkit to scientists to perform all scientific functions with
its flexible architecture. VLifeMDS is ready to meet demands from a structure
based design approach as well as ligand based design approach while a
seamless integration between various modules within VLifeMDS allows a
hybrid approach for discovery projects.
With VLifeMDS users can access intuitive features for multiple activities
within a discovery project. The main objectives are active site analysis,
Homology modeling, pharmacophore identification, conformer generation,
combinatorial library, property visualization, Docking, QSAR analysis,
database querying and virtual screening (http://www.vlifesciences.com/
products/VLifeMDS/Product_VLifeMDS.php).
Active Site Analysis
By studying the active site of the target molecule carefully, the lead compound
is built piece-by-piece using computer software. The surface of the target
molecule to be interacted by lead may have various chemical environments
such as hydrophobicity, hydrogen bonding or catalytic zone. To this field,
fragments of a hypothetical compound are placed. The orientation of the
fragments provides a clue about the final form of the lead compound.
GRID, GREEN, HISTE, HINT and BUCKTS are some of the softwares
used for this kind of active site analysis. Sometimes the entire molecule is fit
into the receptor site or active site. DOCK is a software that uses ‘shape fitting’
approach (Fig. 9.1- 9.1D). It searches all possible ways of fitting a ligand into
the receptor site. The binding site of the receptor or enzyme molecule contains
hydrogen bonding regions and hydrophobic regions.
Fig. 9.1A. Wire frame view of the docking molecules RmID (Rv3266c) (enzyme) and 11za
(ligand) before docking as observed in the Hex window.
Fig. 9.1B. Wire frame view of the very close contact between RmID (Rv3266c) (enzyme)
and 11za (ligand) before docking as observed in the Hex window.
Fig. 9.1C. Harmonic surface view of the RmID (Rv3266c) (enzyme) and 11za (ligand)
after docking process is completed as observed in the Hex window.
Fig. 9.1D. The cartoon model of the RmID (Rv3266c) (enzyme) and 11za (ligand) complex
as observed in the Hex window.
Initially a prototype molecule is positioned inside the active site to satisfy

a few of the bonding energy. Additional building blocks are fitted in stepwise
manner till all the bonding energies are satisfied. CLIX is a software that
creates the active site points and then searches for chemical structure database
that would satisfy the active site.
QSAR
In drug development, lead compounds are optimized by decorating the
molecular skeleton with different functional groups and testing each
derivative for its biological activity. If there are several open positions on the
lead molecule that can be substituted, the total number of molecules that need
to be tested in a comprehensive screen would be very large.
The synthesis and screening of all these molecules would be time-
consuming and laborious, especially since most would have no useful activity.
In order to select those molecules most likely to have a useful activity and thus
guide in chemical synthesis, QSAR can be used. QSAR is Quantitative
Structure–Activity Relationship, a mathematical relationship used to
determine how the structural features of a molecule are related to biological
activity.
Here, essentially, the molecules are treated as groups of molecular
properties (descriptors), which are arranged in a table. The QSAR mines these
data and attempts to find consistent relationships between particular
descriptors and biological activities, thus identifying a set of rules that can be
used to score new molecules for potential activity. A QSAR is usually
expressed in the form of a linear equation:
i=n
Biological activity = constant + ∑ CiPi

i=1
P1-PN are parameters (molecular properties) established for each molecule

in the series and C1-CN are coefficients calculated by fitting variations in the
parameters to their biological activities.
Once the lead molecules are identified, they have to be optimized for
potency, selectivity and pharmacokinetic properties. Four qualities such as the
H bond donors <5, the hydrogen acceptors <10, the relative molecular weight
<500 and lipophilicity <5 are recommended for high bioavailability (intestinal
absorption). The drugs targeted on Central Nervous System should have the
penetrability of Blood–Brain Barrier (BBB).
Some useful websites are given below:
http://www.netsci.org/science/compchem/feature19.html
http://clogp.pomona.edu/medchem/chem./master/search.html
http://chemfinder.cambridgesoft.com
http://www.cas.org/casdb.html
http://www.mdli.com
http://www .daylight.com/dayhtml/smiles/smiles-into.html#TOC
STUDY QUESTIONS
1. What is a drug?
2. What are the methods of discovering a drug?
3. What is a drug target?
4. How is drug target identified?
5. How is the target validated?
6. How do we identify a lead compound?
7. What are the desirable qualities of a lead compound?
8. How do we optimize the lead compound?
9. What are the stages involved in drug trials?
10. What is pharmainformatics?
11. What are the scopes of pharmainformatics?
12. What are chemical libraries?
13. Give the names of some fsearch programs.
APPENDIX
A
List of Important Websites
and Web Addresses
Sequence and Structure based Alignments:

http://molbiol-tools.ca/Alignments.htm
http://www.ebi.ac.uk/Tools/
http://en.wikipedia.org/wiki/Structural_alignment_software
http://en.wikipedia.org/wiki/Sequence_alignment_software
http://www.cgl.ucsf.edu/home/meng/grpmt/structalign-content.html
http://expasy.org/tools/#align
DNA and RNA Databases and Analysis websites:

http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/genbank/
http://molbiol-tools.ca/DNA_Motifs.htm
http://molbiol-tools.ca/Others.htm
http://www.dna.gov/
http://www.ncbi.nlm.nih.gov/guide/dna-rna/
http://rdp.cme.msu.edu/
http://www.bioexplorer.net/Databases/RNA_Databases/
http://research.imb.uq.edu.au/rnadb/
http://rna-mdb.cas.albany.edu/RNAmods/
http://www.oxfordjournals.org/nar/database/cat/2
http://www.ncrna.org/frnadb/
http://www.rna.icmb.utexas.edu/
http://molbiol-tools.ca/RNA_analysis.htm
http://www.rna.uni-jena.de/rna.php
http://regrna.mbc.nctu.edu.tw/html/about.html
A.2 Appendix
Phylogenetic Relationship Among Organisms:

http://www.ncbi.nlm.nih.gov/taxonomy
http://rdp.cme.msu.edu/
http://tolweb.org/tree/
http://molbiol-tools.ca/Phylogeny.htm
http://www.life.umd.edu/labs/delwiche/bsci348s/lec/Phylogenetics1.html
http://www.mathworks.in/help/bioinfo/phylogenetic-analysis.html
Web sources for performing database searches with a simple

query sequence:
http://blast.ncbi.nlm.nih.gov/
http://www.ebi.ac.uk/Tools/sss/
http://www.bioinformaticsonline.org/links/ch_06_t_2.html
http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker
http://smart.embl-heidelberg.de/smart/set_mode.cgi?GENOMIC=1
Examples of guest web sites for performing a database search

based on the Smith Waterman dynamic programming algorithm:
http://www.ebi.ac.uk/Tools/sss/
http://opal.przyjaznycms.pl/en/
http://baba.sourceforge.net/
http://jaligner.sourceforge.net/
http://www.clcbio.com/index.php?id=1046
http://www.ebi.ac.uk/Tools/sss/psisearch/
http://www.compbio.dundee.ac.uk/www-scanps
Programs and websites for database similarity searches with a

regular expression, motif, block or profile:
http://expasy.org/resources/search/keywords:similarity%20search
http://www.ebi.ac.uk/Tools/sss/fasta/
http://www.embl.de/~chenna/elm_2.html
http://scansite.mit.edu/dbsequence_reg.html
http://prosite.expasy.org/scanprosite/
http://myhits.isb-sib.ch/cgi-bin/motif_scan
http://molbiol-tools.ca/Motifs.htm
http://blocks.fhcrc.org/help/
http://www.genome.jp/tools/motif/blocks-blimps.htm
http://mbcf.dfci.harvard.edu/cmsmbr/biotools/biotools1.html
Appendix A.3
Programs and Web pages for sequence translation and related
information:
http://www.ebi.ac.uk/Tools/st/
http://www.ebi.ac.uk/Tools/sequence.html
http://www.ch.embnet.org/pages/services.html
http://web.expasy.org/translate/
http://www.fr33.net/translator.php
http://cgap.nci.nih.gov/Genes/GeneFinder
http://www.kazusa.or.jp/codon/
http://molbiol-tools.ca/Translation.htm
Promoter prediction programs, Web pages, and related

information:
http://www.cbs.dtu.dk/services/Promoter/
http://molbiol-tools.ca/Promoters.htm
http://en.wikipedia.org/wiki/List_of_gene_prediction_software
http://www.shodhaka.com/cgi-bin/startbioinfo/simpleresources.pl?
tn=Promoter%20prediction
http://www.genetools.us/genomics/
Promoter%20databases%20and%20prediction%20tools.htm
http://www.protocol-online.org/prot/Research_Tools/Online_Tools/
Sequence_Analysis/Promoter_and_CpG_island_Prediction/index.html
http://bip.weizmann.ac.il/toolbox/seq_analysis/promoters.html
http://www.fruitfly.org/seq_tools/promoter.html
Web sites for protein structural analysis:

http://www.science.co.il/biomedical/protein-tools.asp
http://molbiol-tools.ca/Protein_tertiary_structure.htm
http://www.geneinfinity.org/sp/sp_structanalysis.html
http://tw.expasy.org/tools/
https://prosa.services.came.sbg.ac.at/prosa.php
http://iris.physics.iisc.ernet.in/psap/
http://www.rcsb.org/pdb/static.do?p=general_information/web_links/
structure_classification.html
http://bioinfo3d.cs.tau.ac.il/wk/index.php/Servers_%26_Software
A.4 Appendix
Programs for viewing protein molecules:

http://www.rcsb.org/pdb/static.do?p=software/software_links/
molecular_graphics.html
http://www.umass.edu/microbio/rasmol/
http://www.cgl.ucsf.edu/chimera/
http://www.pymol.org/
http://jmol.sourceforge.net/
Databases of patterns and sequences of protein families:

http://expasy.org/tools/#pattern
http://www.geneinfinity.org/sp/sp_proteinmotifs.html
http://www.ebi.ac.uk/2can/databases/protein7.html
http://www.hsls.pitt.edu/obrc/
index.php?page=sequences_motifs_functional_sites_annotaions
http://www.cbs.dtu.dk/services/SignalP/
Protein secondary structure prediction:

http://en.wikipedia.org/wiki/
List_of_protein_structure_prediction_software#Secondary_structure_prediction
http://www.cbcb.umd.edu/~salzberg/appendixa.html#StructurePrediction
http://expasy.org/tools/#secondary
Homology modeling and threading/fold recognition servers:

List_of_protein_structure_prediction_software#Threading.2Ffold_recognition
http://www.umass.edu/microbio/chime/pe2.76/pe/protexpl/
psbiores.htm
http://zhanglab.ccmb.med.umich.edu/I-TASSER/
http://expasy.org/tools/#tertiary
Threading_%28protein_sequence%29#Protein_threading_software
Genome information and analysis:

http://www.broadinstitute.org/scientific-community/science/programs/
genome-sequencing-and-analysis/genome-sequencing-and-analysis-
http://www.scfbio-iitd.res.in/research/genome.htm
http://www.helmholtz-muenchen.de/en/mips/services/genomes/
index.html
Appendix A.5
http://www.yeastgenome.org/
http://www.biobase-international.com/product/genome-
trax?gclid=CIGkuMehsrICFYt66wodNEQAFg
http://molbiol-tools.ca/
http://www.completegenomics.com/analysis-tools/cgatools/
http://mbgd.genome.ad.jp/CGAT/
http://bmerc-www.bu.edu/bioinformatics/
http://expasy.org/genomics
Genomic and Proteomic databases of model organisms:

http://www.genome.gov/10001837
http://gmod.org/wiki/Main_Page
http://www.informatics.jax.org/
http://www.hsls.pitt.edu/obrc/index.php?page=non_human_vertebrates
List_of_biological_databases#Genome_databases
http://www.plantcyc.org/external_links/mods.faces
http://www.mtdb.igp.uu.se/
http://www.davidkfaux.org/SHETLANDmtDNADATABASES.htm
http://www.mitomap.org/MITOMAP
http://www.genome.jp/kegg/pathway.html
http://gobase.bcm.umontreal.ca/
http://www.ebi.ac.uk/Databases/proteomic.html
http://ppdb.tc.cornell.edu/
http://www.proteomicworld.org/DatabasePage.html
Human and mouse genome comparison:

http://www.evolutionpages.com/Mouse%20genome%20home.htm
http://www.ornl.gov/sci/techresources/Human_Genome/faq/
compgen.shtml
http://www.genome.gov/page.cfm?pageID=10005831
http://www.knowledgene.com/
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
Gene and genome relationships and Proteome analysis:

http://genome.crg.es/courses/laCaixa05/Genomes/index.html
http://www.complextraitgenomics.com/software/gcta/
http://biology.jbpub.com/book/genetics/
http://www.biodatabases.com/whitepaper03.html
A.6 Appendix
http://www.arabidopsis.org/portals/expression/microarray/
microarraySoftwareV2.jsp
http://mbgd.genome.ad.jp/
http://www.genomesonline.org/cgi-bin/GOLD/index.cgi
http://microbialgenomics.energy.gov/databases.shtml
http://expasy.org/proteomics
https://wiki.nbic.nl/index.php/Proteomics_Tools
https://www.labkey.org/Project/home/CPAS/begin.view
http://www.genome.jp/kegg/pathway.html
http://www.ncbi.nlm.nih.gov/COG/
http://www.ncbi.nlm.nih.gov/unigene
http://pedant.gsf.de/
http://www.ebi.ac.uk/embl/
http://www.ebi.ac.uk/s4/summary/
molecular?term=STRING&classification=7227&tid=gSynFBgn0003525
Metabolism and regulation functional genomics:

List_of_biological_databases#Metabolic_pathway_databases
List_of_biological_databases#Microarray_databases
http://www.ebi.ac.uk/fg/tools.html
http://euratools.rns4u.com/
http://world-2dpage.expasy.org/melanie/
http://en.wikipedia.org/wiki/2D_gel_analysis_software
http://bonsai.hgc.jp/~mdehoon/software/cluster/
http://www.stanford.edu/group/sherlocklab/cluster.html
http://www.geneontology.org/GO.tools.microarray.shtml
http://nf.nci.org.au/facilities/software/
software.php?software=Gene+Cluster&all_sites=yes
http://www.genome.jp/kegg/brite.html
http://world-2dpage.expasy.org/swiss-2dpage/
http://www.yeastgenome.org/
http://smart.embl-heidelberg.de/
http://www.ebi.ac.uk/interpro/databases.html
Appendix A.7
Gene nomenclature, functional characterization and genome
database development:
http://www.informatics.jax.org/mgihome/nomen/
http://www.gramene.org/documentation/nomenclature/
http://www.genenames.org/
http://www.ssr.org/NomenBullets.html
Gene_nomenclature#Nomenclature_guidelines
http://www.informatics.jax.org/mgihome/GXD/aboutGXD.shtml
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
http://biocyc.org/
http://en.wikipedia.org/wiki/Genome_browser
http://wfleabase.org/
http://www.hiv.lanl.gov/content/index
http://www.geneinfinity.org/sp/sp_nucdatabases.html
General Links:
http://www.oxfordjournals.org/nar/database/a/
http://biosharing.org/biodbcore
http://www.ufrgs.br/favet/bioquimica/bioinf/bioinf_links.htm
http://www.colorado.edu/chemistry/bioinfo/BioinformaticsLinks.htm
http://pbil.univ-lyon1.fr/bookmarks.html
http://mbcf.dfci.harvard.edu/cmsmbr/biotools/biotools1.html
http://www.imb-jena.de/~rake/Bioinformatics_WEB/
proteins_purification.html
Databank Information
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks+-
id+19son1NJrIT
Glossary
Ab Initio: Describing an analysis method carried out from first principle

Accession number: A unique number or code (identifier) given to mark the
entry of a sequence (protein or nucleic acid) or pattern (regular
expression, finger-print, profile) to a primary or secondary database.
Adenine: A purine base found in DNA and RNA
Affine gap penalty: A gap penalty score that is a linear functions of gap
length, consisting of a gap opening penalty and a gap extension
penalty multiplied by the length of the gap.
Algorithm: A set of rules with logical sequence of steps using which a task
can be performed.
Alignment: Arrangement of two or more nucleotides or protein sequences to
maximize the number of matching monomers.
Alignment score: An algorithmically computed score based on the number
of matches, substitutions, insertions and deletions (gaps) within an
alignment. Alignment scores are in log odds units, often bit units (log
to the base 2).
Alphabet: The total number of symbols in a sequence - 4 for DNA sequences
and 20 for protein sequences.
Amino acid: The fundamental building block of proteins. There are 20
naturally occurring amino acids in animals and around 100 more
found only in plants.
Analogs: In phylogenetics, non-homologous proteins that have similar
folding architectures, or similar functional sites, which are believed to
have arisen through convergent evolution.
Annotation: A combination of comments, notations, references and citations,
either in free format or utilizing a controlled vocabulary that together
describes all the experimental and inferred information about a gene
or protein.
Applet: Small software applications loaded from a server via HTML pages.
Archive: A collection of files.
G.2 Glossary
Assembly: The process of aligning overlapping sequence fragments into a

contig or series of contigs.
ASCII: The American Standard Code for Information Interchange. AsCII
specifies 128 characters that are mapped to the values 0-127.
Assembly: The process of aligning overlapping sequence fragments into a
contig or series of contigs.
Attachment: A file that is sent appended to an email
Basepair (bp): Any possible pairing between bases in opposing strands of
DNA or RNA. Adenine pairs with thymine in DNA or with uracil in
RNA; and guanine pairs with cytosine.
Bit units: A bit denotes the amount of information reqiored tp dostomgiosj
betweem two equally likely possibilities (from information theory).
The number of bits of information, N, required to convey a message
that has N possibilities is log2M = N bits.
Bit: A binary digit.
BLAST: A program for sequence database similarity searching.
Block: An ungapped, aligned motif consisting of sequence segments that are
clustered to reduce multiple contributions from groups of highly
similar or identical sequences.
BLOSUM matrix: It is a matrix derived using local multiple alignments of
more distantly related sequences. It is used to assess similarity of
sequences when performing alignments.
Branch length: In sequence analysis, the number of sequence changes along a
particular branch of a phylogenetic tree.
Browser: A computer program (commonly known as a web client) that
permits information retrieval from the Internet and the www.
cDNA library: A gene library composed of cDNA inserts synthesized from
mRNA using reverse transcriptase.
cDNA: (complementary DNA) A DNA strand copied from mRNA using
reverse transcriptase.
Cell: The basic unit of any living organism.
Central dogma: A fundamental principle of molecular biology, first
expounded by Francis Crick in 1958, essentially stating that the
transfer of information from nucleic acid to nucleic acid or from
nucleic acid to protein is possible, while transfer from protein to
nucleic acid or from protein to protein is impossible.
Chaperone: A protein that assists the correct non-covalent assembly of
folding proteins in vivo; chaperones do not themselves form part of
the structures they help to assemble.
Chromosomes: The paired, self-replicating genetic structures of cells that
contain the cellular DNA; the nucleotide sequence of the DNA
encodes the linear array of genes.
Glossary G.3
Cladogram: A dendogram in which each node has two branches, representing
evolutionary history as speciation by bifurcation of the evolutionary
lineage.
Client: Any program that interacts with a server (LYNX, Mosaic and Netscape
are examples of client software).
Clone: A copied fragment of DNA, maintained in circular form, identical to the
template from which it is derived; also a population of genetically
identical cells derived from a single ancestor.
Cloning vector: A DNA molecule originating from a virus, a plasmid, or the
cell of a higher organism into which another DNA fragment can be
integrated without compromising the vector's capacity for self-
replication.
Cloning: The process of generating identical copies of a DNA fragment (that
may encode a complete gene) from a single template DNA or
producing identical copies of cells from single ancestor.
Cluster analysis: A method for grouping together a set of objects those are
most similar from a larger group of related objects. The relationships
are based on some criterion of similarity or difference.
Cluster: The grouping of similar objects in a multidimensional space.
Coding sequence (CDS): A region of DNA or RNA whose sequence
determines the sequence of amino acids in a protein.
Codon: A sequence of three adjacent nucleotides that designates a specific
amino acid or start/stop site for transcription.
Command language: The language used for giving instructions to a
computer operating system.
Command line: The basic level at which a computer prompts the user for
input.
Communication protocol: An agreed set of rules for structuring
communication between programs (allowing, for example, data
exchange between nodes on the Internet).
Comparative genomics: A comparison of gene numbers, gene locations, and
biological functions of genes in the genomes of diverse organisms, one
objective being to identify groups of genes that play a unique
biological role in a particular organism.
Comparative modeling: The process of predicting protein structure based on
related sequence of known structure.
Composite database: A database that amalgamates a number of primary
sources, using a set of defined criteria that determine the priority of
inclusion of the different sources and the level of redundancy
retained.
Conceptual translation: The computational process of interpreting the
sequence of nucleotides in mRNA via the genetic code to a sequence
of amino acid which may or may not code for protein.
G.4 Glossary
Conformation: The precise three-dimensional arrangement of atoms and

bonds in a molecule describing its geometry and hence its molecular
function.
Consensus sequence: A pseud-sequence that summarises the residue
information contained in a multiple alignment.
Conserved sequence: A sequence of bases in a DNA molecule (or an amino
acid sequence in a protein) that has remained essentially unchanged
during evolution.
Contig: Sequences of clones, representing overlapping regions of a gene
presented as an assembly or multiple alignment.
CORBA: The Common Object Request Broker Architecture (CORBA) is an
open industry standard for working with distributed objects,
developed by the Object Management Group. CORBA allows the
interconnection of objects and applications regardless of computer
language, machine architecture, or geographic location of the
computers.
Cytosine: A pyrimidine base found in DNA and RNA.
Database: A collection of data records either in a single file or as multiple
files.
Dendrogram: A branching graph used to represent phylogenetic
relationships.
Descriptor: Information about a sequence or set of sequences whose scope
depends on its placement in a record.
Discriminator: A mathematical abstraction of a conserved motif, or set of
motifs (e.g. a regular expression pattern, a profile or a fingerprint),
used to search either an individual query sequence or a full database
for the occurrence of that same or similar motif(s).
DNA (deoxyribonucleic acid): The molecule that encodes genetic
information. DNA is a double-stranded molecule held together by
weak bonds between basepairs of nucleotides. The four nucleotides in
DNA contain the bases: adenine (A), guanine (G) cytosine (C), and
thymine (T). In nature, basepairs form only between A and T and
between G and C; thus the base sequence of each single strand can be
deduced from that of its partner.
DNA sequence: The linear sequence of base pairs, whether in a fragment of
DNA, a gene, a chromosome or an entire genome.
DNAse: (deoxyribonuclease) One of a series of enzymes that can digest
DNA.
Domain: A compact, local, semi-independent folding unit, presumed to have
arisen via gene fusion and gene duplication events. Domains need not
be formed from contiguous regions of an amino acid sequence: They
may be discrete entities joined only by a flexible linking region of the
Glossary G.5
chain; they may have extensive interfaces, sharing many close contacts;
and they may exchange chains with domain neighbours. The
combination of domains within a protein determines its overall
structure and function.
Dot matrix: Dot matrix diagram provides a graphical method for comparing
two sequences.
Download: Transferring files from a computer network to a local computer.
Drug: An agent that affects a biological process
Dumb: A dumb terminal is a desktop display device that is not capable of
local processing, this being entirely carried out by the central
computer. Such terminals do not support windowing applications.
Dynamic programming: A method for the comparison and alignment of
strings or sequences in a way that allows the computationally efficient
incorporation of gaps.
Edman degradation: A method used in sequencing polypeptides, whereby
amino acid residues are removed sequentially from N-terminus by
reaction with phenyl-isothiocyanate, to form phenylthiocarbamyl-
peptide (PTC-peptide). This is cleaved in anhydrous acid, releasing a
thiazolinone intermediate and the remainder of the peptide.
E-mail: (Electronic mail) Message composed in a computer and transmitted
via the Internet to a remote location within seconds.
Enzyme: A protein that acts as a catalyst, speeding the rate at which a
biochemical reaction proceeds but not altering the direction or nature
of the reaction.
Exons: The protein-coding DNA sequences of agene.
Expressed Sequence Tag (EST): A partial sequence of a clone, randomly
selected from a cDNA library and used to identify genes expressed in
a particular tissue.
Expression profile: The characteristic range of genes expressed at different
stages of a cell's development and functioning.
Expression vector: A cloning vector that is engineered to allow the
expression of protein from a cDNA.
False-negative: A true match that incorrectly fails to be recognized by a
discriminator.
False-positive: A false match incorrectly recognized by a discriminator.
Feature: Annotation on a specific location on a given sequence.
File Transfer Protocol (FTP): A method of transferring files to remote
computers.
File: A discrete collection of bytes that can be manipulated as a single entity.
Fingerprint: A group of ungapped motifs excised from a sequence alignment
and used to build a characteristic signature of family membership by
means of interactive searching of a primary (or composite) database.
G.6 Glossary
Flat-file: A human-reliable data-file in a convenient form for interchange of

database information. Flat-files may be created as output from
relational databases, in a format suitable for loading into other
databases.
Fold: The basic tertiary structure of a protein, including the secondary
structure elements, their sequential connections and relative spatial
positions.
Folding problem: The problem of determining how a protein folds into its
final 3D form given only the information encoded in its primary
structure.
Frameshift: An alteration in the reading sense of DNA resulting from an
inserted or deleted base, such that the reading frame for all
subsequent codons is shifted with respect to the number of changes
made (e.g. if a sequence should reach UCU-CAA-AGG-UUA, and a
single U is added to the beginning, the new sequence would read
UUC-UCA-AAG-GUU, etc.). Frameshift may arise through random
mutations, or via errors in reading sequencing output.
Functional genomics: Assessment of the function of genes identified by
genome comparisons. The function of a newly identified gene is tested
by introducing mutations into the gene and then examining the
resultant mutant organism for an altered phenotype.
Gap penalty: A penalty subtracted from a sequence similarity score to
account for gaps in a sequence alignment.
Gap: A part of a sequence alignment where one sequence contains no aligned
monomer.
Gene duplication: A genetic alteration in which a segment of DNA is
repeated. Duplications may appear anywhere, but where the duplicated
segment is adjacent to the original one, this is termed as tandem
duplication.
Gene Expression: The process by which a gene's coded information is
converted into the structures present and operating in the cell.
Expressed genes include those that are transcribed in mRNA and then
translated into protein and those that are transcribed into RNA but not
translated into protein (e.g. transfer and ribosomal RNAs).
Gene families: Groups of closely related genes that encode similar protein
products.
Gene product: The protein resulting from the expression of a gene. In some
cases, the gene product may be an RNA molecule that is never
translated.
Gene: The fundamental physical and functional unit of heredity. A gene is an
ordered sequence of nucleotides located in a particular position on a
particular chromosome that encodes a specific functional product (i.e.,
a protein or RNA molecule).
Glossary G.7
Genetic Algorithm: A kind of search algorithm that was inspired by the
principles of evolution. A population of initial solutions is encoded
and the algorithm searches through these by applying a pre-defined
fitness measurement to each solution, selecting those with the highest
fitness for reproduction.
Genetic Code: The rules that relate the four DNA or RNA bases to the 20
amino acids. There are 64 possible three-base (triplet) sequences,
which are known as codons. A single triplet uniquely defines one
amino acid, but an amino acid may be coded by as many as six
codons. The code is thus said to be degenerate.
Genetic Map: The relative positions of known genes or markers.
Genome: All the genetic material in the chromosomes of a particular
organism; its size is generally given as its total number of basepairs.
Global alignment: Attempts to match as many characters as possible, from
end to end, in a set of more than two sequences.
Guanine: one of the nitrogenous purine bases found in DNA and RNA
Heuristic algorithm: An economical strategy for deriving a solution to a
problem for which an exact solution is computationally impractical.
Hidden Markov Model (HMM): A probabilistic model consisting of a number
of interconnecting states. Like profiles, HMMs encode full domain
alignments. They are essentially linear chains of match, delete or insert
states; a match state denotes a conserved column in an alignment; an
insert state allows insertions relative to match states; and delete states
allow match positions to be skipped.
High throughput screening: The technique of using automated assays to
search through large numbers of compounds for desired activity.
Home page: The HTML document that acts as the first contact point between
a browser and a server.
Homology: Being related by the evolutionary process of divergence from a
common ancestor. Homology is not a synonym for similarity.
Hybridization: The process of joining two complementary strands of DNA
or one each of DNA and RNA to form double-stranded molecule.
Hydropathy profile: A graph in which hydropathy values are calculated
within a sliding window and plotted for each residue in a protein
sequence. Such graphs show characteristic peaks and troughs,
corresponding to the most hydrophobic and hydrophilic regions of the
sequence respectively.
Hydropathy: Having the property of hydrophobicity, a low affinity for
water.
Hyperlink: An active HTTP cross-reference that links one web document to
another document on the Internet.
Hypermedia: Formatted Web documents containing a variety of information
types, including text, image, movie and audio.
G.8 Glossary
Hypertext Markup Language (HTML): The syntax governing the way

documents are created so that they can be interpreted and rendered by
web browser.
Hypertext Transport Protocol (HTTP): The communication protocol used by
web servers.
Hypertext: Text that contains embedded links (hyperlinks) to other
documents.
Idiotype: The numbers and size of chromosomes in a cell of an organism.
In silico: The use of computers to stimulate, process, or analyse biological
experiment.
INDEL: An INSertion/ DELetion in a DNA or Protein sequence.
Information theory: A branch of mathematics that measures information in
terms of bits, the minimal amount of structural complexity needed to
encode a given piece of information.
Insertion: Part of a sequence alignment where one sequence appears to have
extra monomers compared with another sequence.
Internet Inter-ORB Protocol (IIOP): The communication protocol used by
object-request brokers to communicate over the Internet.
Internet: The international network of computer networks that connect
government, academic and business institutions.
Intranet: Computer network isolated from the Internet by means of a firewall
but that offers similar facilities to the local community (e.g. Web
servers, mail, etc.).
Intron: Non-coding region of DNA.
Introns: The sequence of DNA bases that interrupts the protein-coding
sequence of a gene; these sequences are transcribed into RNA but are
edited out of the message before it is translated into protein.
IP: Internet Protocol
IP Address: Internet Protocol address- a unique identifying number assigned
to each computer on the Internet to allow communication between
them.
Iterative: A sequence of operations in a procedure that is performed
repeatedly.
Java: An object-oriented, network programming language that permits
creation of either stand-alone programs, or applets that are launched
via links on web pages. In theory, Java programs run on any machine
that supports the java run-time environment (including PCs and UNIX
workstations).
Java script: A scripting language designed for web-based applications.
Karyotype: The number and size of chromosomes in a cell of an organisms.
Kilobase (kb): Unit of length for DNA fragments equal to 1000 nucleotides.
k-tuple: Identical short stretches of sequences, also called words.
Glossary G.9
Lead compound: A substance that has many of the characteristics of an ideal
drug and which interacts with a specific target.
Library: An unordered collection of clones (i.e., cloned DNA from a
particular organism), generated from genomic DNA or cDNA.
Ligand: Any small molecule that binds to a protein or receptor.
Local alignment: Attempts to align regions of sequences with the highest
density of matches in two short sequences.
Log odds score: The logarithm of an odds score.
Machine code: The binary code interpreted by a computer's processor.
Megabase (mb): Unit of length for DNA fragments equal to 1 million
nucleotides.
Microarray: A miniature device, also known as a chip, containing hundreds
or thousands of different molecules immobilized in a regular pattern.
Mirror: Identical web sites, hosted on different computers, such that the data
might be acquired more quickly by users in specific countries.
Monte Carlo: A method that samples possible solutions to a complex
problems as a way to estimate a more general solution.
Mosaic: A mosaic protein is a modular protein, that rather than including
multiple tandem repeats of the same module, is composed of a
number of different modules, each conferring different aspects of the
parent protein's overall functionality (e.g. the calcium independent
latrotoxin receptor, a mosaic of EGF-like and laminin G-like modules).
Motif: A consecutive string of amino acids in a protein sequence whose
general character is repeated, or conserved, in all sequences in a
multiple alignment at a particular position. Motifs are of interest
because they may correspond to structural or functional elements
within the sequence they characterize.
mRNA: (messenger RNA) complementary RNA copy of DNA formed from a
single stranded DNA template during transcription that migrates from
the nucleus to the cytoplasm.
Mutation: Any change in DNA sequence
Needleman-Wunsch algorithm: Uses dynamic programming to find global
alignments between sequences.
Neighbor-joining method: Clusters together alike pairs within a group of
related objects to create a tree whose branches reflect the degrees of
difference among the objects (genes with similar sequences).
Neural network: From artificial intelligence algorithms, techniques that
involve a set of many simple units that hold symbolic data, which are
with numeric weights. Units operate only on their symbolic data and
on the inputs that they receive through their connections.
Normalized Library: cDNA library generated such that all the genes in the
library are represented at the same frequency.
G.10 Glossary
Northern blotting: A technique to identify RNA molecules by hybridization.

Nucleotide: A molecule consisting of a nitrogenous base (A, G, T and C in
DNA; A, G, U or C in RNA), a phosphate moiety and a sugar group
(deoxyribose in DNA and ribose in RNA). Thousands of nucleotide
are linked to form a DNA or RNA molecule.
Object-oriented database: A database in which data are stored as abstract
objects, with abstract relationships between them. The data
representations are potentially very varied, including, for example,
character string, digitized images, tables, etc. An object may subsume
many other objects, and the database allows retrieval of the objects as
a whole. The flexibility of data representation, and the ability to group
objects together, renders object-oriented databases potentially very
powerful systems.
Odds score: The ratio of the likelihoods of two events or outcomes. In
sequence alignments and scoring matrices, the odds score for
matching two sequence characters is the ratio of the frequency with
which the characters are aligned in related sequences divided by the
frequency with which those same two characters align by chance
alone, given the frequency of occurrence of each in the sequences. Odd
scores for a set of individually aligned positions are obtained by
multiplying the odd score for each position. Odds scores are often
converted to logarithms to create log odds score that can be added to
obtain the log odds score of a sequence alignment.
Ontology: Relationship between objects, especially in artificial intelligence
systems.
Open reading frame (ORF): A series of DNA codons, including a 5' initiation
codon and a termination codon that encodes a putative or known
gene.
Operating system: A program, or suite of programs, that controls the entire
operation of the computer, handling input/output operations,
interrupts user requests, etc. (e.g. UNIX, VMS, Window NT, etc.
Optimal alignment: The highest-scoring alignment found by an algorithm
capable of producing multiple solutions. This is the best possible
alignment that can be found, given any parameter supplied by the
user to the sequence alignment program.
Operon: A unit of transcription consisting of one or more structural genes, an
operator and a promoter
Orthologs: Homologous proteins that perform the same function in different
species.
Packet: A self-contained message, or component of a message, comprising
address, control and data signals, which may be transferred as a single
entity within a communication network.
Pairwise alignment: An alignment performed between two sequences.
Glossary G.11
PAM scoring matrix: Percent Accepted Mutation (PAM) matrix describes the
probability that one base or amino acid has changed during the course
of evolution. Amino acid PAM matrix is derived from families of
closely related sequences and is used to access the similarity of
sequences when performing alignments.
Paralogs: Homologous proteins that perform different but related functions
within one organism.
Parametric sequence alignment: An algorithm that finds a range of possible
alignment based on varying the parameters of the scoring system for
matches, mismatches, and gap penalties.
Pattern: Molecular biological patterns usually occur at the level of the
characters making up the gene or protein sequence.
Penalties: Scores, or weights, used by programs in the computation of
sequence alignments; such scores are normally supplied as parameters
to the programs and thus may be modified by the user.
Peptide: A short stretch of amino acids each covalently coupled by a peptide
bond between two nucleotide or amino acid sequences.
Percent similarity: An alignment score used for amino acid sequences in
which a substitution matrix is used to rank the substitution scores of
different amino acids.
Phantom INDELs: Spurious insertions or deletions that arise when physical
irregularities in a sequencing gel cause the reading software either to
call a base too soon or to miss a base altogether.
Pharmainformatics: The branch of information science that deals with
handling biological and chemical data in the pharmaceutical industry.
Phylogenetic analysis: Study of the evolutionary relationships between a
species and its predecessors (e.g. using phylogenetic trees).
Phylogenetic tree: A graphical representation of the putative evolutionary
relationships between groups of organisms, e.g. as calculated from
multiple protein or nucleic acid sequence alignments.
Polymerase Chain Reaction (PCR): A method for amplifying a DNA base
sequence using a heat-stable polymerase and two primers, one
complementary to the (+) strand at one end of the sequence to be
amplified and the other complementary to the (-) strand at the other
end. The faithfulness of reproduction of the sequence is related to the
fidelity of the polymerase.
Position-specific scoring matrix: Represents the variation found in the
columns of an alignment of a set of related sequences. Each
subsequent matrix column corresponds to the next column in the
alignment and each row corresponds to a particular sequence
character.
Post-translational modification: An enzyme-catalyzed alteration to a protein
made after its translation from mRNA (e.g. glycosylation,
phosphorylation, myristoylation, methylation).
G.12 Glossary
Primary database: A database that stores biomolecular sequences (protein or

nucleic acid) and associated annotation information (organism,
species, function, mutations, linked to particular diseases, functional/
structural patterns, bibliographic, etc.).
Primary structure: The linear sequence of amino acids in a protein molecule.
Primer: A short polynucleotide chain to which new deoxyribonucleotides can
be added by DNA polymerase.
Probe: A DNA or protein sequence used as a query in a database search
Profile: A position-specific scoring table that encapsulates the sequence
information within complete alignments. Profiles define which
residues are allowed at given positions; which positions are conserved
and which degenerate; and which positions, or regions, can tolerate
insertions. In addition to data implicit in the alignment, the scoring
system may include evolutionary weights and results from structural
studies. Variable penalties are specified to weight against insertions
and deletions occurring in secondary structure elements.
Prokaryote: An organism lacking a membrane-bound, structurally discrete
nucleus and other subcellular compartments. Bacteria are prokaryotes.
Promoter: A site on DNA to which RNA polymerase will bind and initiate
transcription.
Protein: A molecule composed of one or more chains of amino acids in a
specific order; the order is determined by the base sequence of
nucleotides in the gene coding for the protein. Proteins are required
for the structure, function and regulation of cells, tissues and organs,
each protein having a specific role (e.g. hormones, enzymes and
antibodies).
Proteome: The entire complement of proteins produced by a particular
genome, including variants of the same basic protein generated by
post-translational modifications, etc.
QSAR: (Quantitative Structure-Activity Relationship). A mathematical
function used to relate the structural features of a molecule to its
biological function.
Quaternary structure: The arrangement of separate protein chains in a
protein molecule with more than one subunit.
Query sequence: A DNA, RNA or protein sequence used to search a
sequence database in order to identify close or remote family members
of known function.
Regular expression: A single consensus expression derived from a conserved
region of a sequence alignment, and used as a characteristic signature
of family membership. Synonymous terms: rule, pattern.
Regulator regions or sequences: A DNA base sequence that controls gene
expression.
Glossary G.13
Relational database: A database that uses a relational data model, in which
data are stored in two-dimensional tables. The tables embody different
aspects or properties of the data, but contain overlapping information.
R-factor: In X-ray crystallography, this parameter is used to express the
extent of agreement between theoretical calculations and the
measured data; the lower the R-factor, the better the fit (R means
either residual or reliability).
RNA (ribonucleic acid): A molecule chemically similar to DNA that plays a
central role in protein synthesis. The structure of RNA is similar to
that of DNA but it is inherently less stable. There are several classes of
RNA molecule, including messenger RNA (mRNA), transfer RNA
(tRNA), ribosomal RNA (rRNA), and other small RNAs, each serving
a different purpose.
Rooted tree: A phylogenetic tree in which the least common ancestor of all
the species in the tree is present as an ancestral outgroup.
Rule: A short regular expression (typically 4-6 residues in length) used to
identify genome (non-family specific) patterns in protein sequences.
Rules tend to be used to encode particular functional sites: e.g. sugar
attachment sites, phosphorylation, hydroxylation, sulphation sites, etc.
However, their small size means that the patterns do not provide good
discrimination, and can only give a guide as to whether a certain
functional site might exist in a sequence.
Secondary database: A database that contains information derived from
primary sequence data, typically in the form of regular expressions
(patterns), fingerprints, blocks, profiles or Hidden Markov Models.
These abstractions represent distillations of the most conserved
features of multiple alignments, such that they are able to provide
potent discriminators of family membership for newly determined
sequences.
Secondary structure: Regions of local regularity within a protein fold (e.g. -
helices, -turns, -strands).
Sequence alignment: A linear comparison of amino (or nucleic) acid
sequences in which insertions are made in order to bring equivalent
positions in adjacent sequences into the correct register. Alignments
are the basis of sequence analysis methods, and are used to pinpoint
the occurrence of conserved motifs.
Sequence Tagged Site (STS): Short (200-500 basepairs) DNA sequence that
has a single occurrence in the human genome and whose location and
base sequence are known. Detectable by polymerase chain reaction
(PCR), STS, are useful for localizing and orienting the mapping and
sequence data reported from many different laboratories and serve as
landmarks on the developing physical map of the human genome.
Expressed sequence tags (ESTs) are STSs derived from cDNA.
G.14 Glossary
Sequencing: Determination of the order of nucleotides (base sequences) in a

DNA or RNA molecule, or the order of amino acids in a protein.
Server: A computer or software system that communicates information via
the Internet to a client.
Shotgun method: Cloning of DNA fragments randomly generated from a
genome.
Silent mutation: A nucleotide substitution that does not result in an amino
acid substitution in the translation product, because of the redundancy
of the genetic code.
Six frame translation: Translations of a stretch of DNA taking into account
three forward translation and three reverse translations, arising from
the three possible reading frames of an uncharacterized stretch of
DNA.
Smith-Waterman algorithm: Uses dynamic programming to find local
alignments between sequences. The key feature is that all negative
scores calculated in the dynamic programming matrix are changed to
zero in order to avoid extending poorly scoring alignments and to
assist in identifying local alignments starting and stopping anywhere
with the matrix.
SNP: (Single Nucleotide Polymorphism) A change in DNA sequence at a
single residue.
Splice variants: Proteins of different length that arise through translation of
mRNAs that have not included all available exons in the template
DNA.
SRS: (Sequence Retrieval System) A data retrieval tool.
Structure prediction: Algorithms that predict the secondary, tertiary and
even quaternary structure of proteins from their sequences.
Subunit: A distinct polypeptide chain within a protein that may be separated
from other chains (whether identical or different) without breaking
covalent bonds.
Super-secondary structure: The arrangement of -helices and/or -strands in
a protein sequence into discrete folded structures (e.g., -barrels,
units, Greek keys, etc.).
Target: A molecule that is critical to a disease that may be targeted with a
potential therapeutic agent.
Telnet protocol: A method of communication between remote computers
that allows users to log on and use the distant machines as if
physically present at the remote location.
Tertiary database: A database derived from information housed in
secondary (pattern) databases (e.g. the BLOCKS and eMOTIF
databases, which draw on data stored within PROSITE and PRINTS).
The value of such resources is in providing a different scoring
Glossary G.15
perspective on the same underlying data, allowing the possibility to
diagnose relationships that might be missed using the original
implementation.
Tertiary structure: The overall fold of a protein sequence, formed by the
packing of its secondary and/or super secondary structure elements.
Threading: In protein structure prediction, the aligning of the sequence of a
protein of unknown structure with a known 3D structure to determine
whether the amino acid sequence is spatially and chemically
compatible with the structure.
Transcript: The single stranded mRNA chain that is assembled from a gene
template.
Transcription: The synthesis of an RNA copy from a sequence of DNA (a
gene); the first step in gene expression.
Translation: The process in which the genetic code carried by mRNA directs
the synthesis of proteins from amino acids.
Transmembrane domain: A region of a protein sequence that traverses a
membrane; for -helical structures, this requires a span of 20-25
residues.
Transmission Control Protocol/ Internet Protocol (CTCP/IP): The rules that
govern data transmission between two computers over the Internet.
True-negative: A false match that correctly fails to be recognized by a
discriminator.
True-positive: A true match correctly recognized by a discriminator.
Uniform Resource Locator (URL): The address of a source of information.
The URL comprises four parts- the protocol, the host name, the
directory path and the file name (e.g. http://www.biochem.url.ac.uk/
bsm/dbbrowser/prefacefrm.html).
Upstream: Further back in the sequence of a DNA molecule, with respect to
the direction in which the sequence is being read.
Western blot: Technique in which specific antibodies are used to identify
their antigens from a mixture of proteins.
Widow: Amino acid residues isolated from neighboring residues by spurious
gaps, usually the result of over-zealous gap insertion by automatic
alignment programs.
World Wide Web (www): The information system or network on the
Internet that uses HTTP as the primary communication medium.
X-ray Crystallography: A technique to determine the three dimensional
structure of a protein.
References
1. Attwood, T.K. and Parry-Smith, D.J., 2002. Introduction to

Bioinformatics, Pearson Education (Singapore)Pte. Ltd., Singapore.
2. Bal, H.P., 2005. Bioinformatics - Principles and Applications, Tata
McGraw Hill Publishing Company Ltd., New Delhi
3. Baldi, P. and Brunak, S., 1998. Bioinformatics - the Machine Learning
Approach. The MIT Press, Cambridge, MA.
4. Banaszak, J., 2000. Foundations of Structural Biology, Academic Press,
NY, USA.
5. Baxevanis, A.D. and Quellette, F.B.F. (Eds.), 2001. Bioinformatics: A
practical Guide to the Analysis of Genes and Proteins, 2nd ed., John
Wiley & Sons, New York.
6. Bensorn, G. and Page, R. (Eds.), 2004. Algorithms in Bioinformatics,
Springer Verlag, Berlin.
7. Bergeron, B., 2003. Bioinformatics Computing, Prentice-Hall, London,
England.
8. Berners-Lee, T., 1999. Weaving the Web, Harpper Collins Publishers
Inc., NY, USA.
9. Bourne, P.E. and Weissing, H. (Eds.), 2003. Structural Bioinformatics,
John Wiley & Sons, New York.
10. Brown, S.M., 2000. Bioinformatics: A Biologist's Guide to
Biocomputing and the Internet. Eaton Publishing Biotechniques Books
Division, Natick.
11. Cambell, A.M. and Heyer, L.H., 2003. Discovering Genomics,
Proteomics and Bioinformatics, Benjamin Cummings.
12. Cantor, C.R. and Smith, C.L., 1999. Genomics: The Science and
Technology Behind the Human Genome Project, John Wiley and Sons
Inc., NY, USA.
13. Dan Gusfield, 1997. Algorithms on Strings Trees and Sequences,
Cambridge University Press, Cambridge.
14. David Mount, W., 2001. Bioinformatics, Cold Spring Harbor
Laboratory Press, New York.
R.2 References
15. Dayhoff, M.D. (Ed.), 1978. Atlas of Protein Sequence and Structure,
National Medical Research Foundation, Washington.
16. Durbin, R., Eddy, S., Krogh, A. and Mitchinson, G. (Eds.), 1998.
Biological Sequences Analysis Probabilities Models of Proteins and
Nucleic Acids, Cambridge University Press, Cambridge.
17. Dwyer, R.A., 2003. Genomic Perl: From Bioinformatics Basics to
Working Code, Cambridge University Press, New York.
18. Eidhammer, I., et al., 2004. Protein Bioinformatics: Algorithmic
Approach to Sequence and Structure Analysis, John Wiley & Sons,
New York.
19. Ewebs, W.J., 2004. Statistical Methods in Bioinformatics, Introduction,
Springer Verlag, Berlin.
20. Felsenstein, J., 2004. Inferring Phylogenies, Sinauer, Sunderland, MA.
21. Gibas, C. and Jambeck, P. 2001. Developing Bioinformatics Computer
Skills, O'Reilly, Shroff Publishers and Distributors Pvt. Ltd., Mumbai.
22. Greg Gibson and Muse Spenser, V., 2002. Primer of Genomic Science,
Sinuaer Associates Inc., Publishers, Sunderland.
23. Higgins, D. and Taylor, W. (Eds.), 2000. Bioinformatics: Sequence
Structure and Databanks a Practical Approach, Oxford University
Press, Oxford.
24. Hillis, D.M., Moritz, C. and Mable, B.K. (Eds.), 1996. Molecular
Systemics, Sinauer Associates Inc., Sunderland.
25. Jamison, C.D., 2004. Perl Programming for Bioinformatics and
Biologists, John Wiley & Sons, New York.
26. Jonathan Pevzner, 2003. Bioinformatics and Functional Genomics,
John Wiley & Sons, New York.
27. Khan, I.A. and Khanum, A. (Eds.), 2002. Fundamentals of
Bioinformatics, Ukaaz Publications, Hyderabad.
28. Khan, I.A. and Khanum, A. (Eds.), 2003. Essentials of Bioinformatics,
Ukaaz Publications, Hyderabad.
29. Khan, I.A. and Khanum, A. (Eds.), 2003. Recent Advances in
Bioinformatics, Ukaaz Publications, Hyderabad.
30. Kinser, J., 2009. Python for Bioinformatics, Jones and Bartlett
Publishers, London.
31. Krane, D.E. and Raymer, M.L., 2003. Fundamental Concepts of
Bioinformatics, Pearson Education Singapore Pte. Ltd., Singapore.
32. Krawetz, S.A. and Womble, D.D. (Eds.), 2003. Introduction to
Bioinformatics - Theoretical and Practical Approach, Humana Press,
Totawa.
33. Lacroix, Z. and Critchlow, T. (Eds.), 2003. Bioinformatics Managing
Scientific Data, Morgan Kaufmann.
34. Leach, A., 2001. Molecular Modeling, Prentice-Hall, London, England.
References R.3
35. Lengauer, T (Ed.), 2002. Bioinformatics from Genomes to Drugs, John
Wiley & Sons, New York.
36. Lengauer, T. (Ed.), 2007. Bioinformatics - From Genomes to Therapies,
Vols 1,2 and 3. Wiley-VCH Verlag Gmbh & Co, Germany
37. Leonard, J.B., 2000. Foundation of Structural Biology, Academic Press,
New York.
38. Lesk, A.M., 2003. Introduction to Bioinformatics, Oxford University
Press, Oxford.
39. Luke Alphe, 1997. DNA Sequencing: From Experimental Methods to
Bioinformatics, BIOS Scientific Publishers, Oxford.
40. Mani, K. and Vijayaraj, N., 2002. Bioinformatics - A Practical
Approach, Aparnaa Publications, Coimbatore.
41. Mani, K. and Vijayaraj, N., 2002. Bioinformatics for Beginners, (Ed.) D.
Padmanaban, Kalaikathir Achagam, Coimbatore.
42. Mishra, A., 2001. Bioinformatics and Human Genome, Authorspress
Publishers, Delhi, India.
43. Mount, D.M., 2004. Bioinformatics: Sequence and Genome Analysis,
2nd Ed. Cold Spring Harbor Laboratory Press, N.Y.
44. Mount, D.W., 2003. Bioinformatics, Sequence and Genome Analysis.
CBS.
45. Murthy, C.S.V., 2003. Bioinformatics, Himalaya Publishing House, New
Delhi.
46. Orengo, C., et al., 2003. Bioinformatics: Gene, Proteins and Computers,
BIOS Scientific Publishers, Oxford.
47. Pevzner, Pavel, 2000. Computational Molecular Biology - Al
Algorithmic Approach, The MIT Press, Cambridge, MA.
48. Racjard, D. (Ed.), 1997. Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids, Cambridge University Press,
Cambridge.
49. Rajadurai, M., 2010. Bioinformatics: A Practical Manual, PBS Book
Enterprises, Chennai.
50. Rashidi, H.H. and Buchler, L.K., 2000. Bioinformatics Basics,
Applications in Biological Science and Medicine, CRC Press, Florida,
USA.
51. Roy, D., 2009. Bioinformatics. Narosa Publishing House, New Delhi.
52. Sehomberc, D. and Lessel, U. (Eds.), 1995. Bioinformatics: From
Nucleic Acids and Proteins to Cell Metabolism, VCH.
53. Stephen misener and Stephen Krawetz, A. (Eds.), 2001. Bioinformatics
Methods and Protocols, Humana Press, Totowa.
54. Sundararajan, S. and Balaji, R., 2002. Introduction to Bioinformatics,
Himalaya Publishing House, New Delhi.
R.4 References
55. Thomas, E.C. , 1992. Proteins: Structures and Molecular Properties, 2nd
Ed., Freeman.
56. Tisdall, J.D., 2001. Beginning Perl for Bioinformatics, O'Reilly
Publishers.
57. Tisdall, J.D., 2001. Mastering Perl for Bioinformatics, O'Reilly
Publishers.
58. Waterman, M.S., 1995. Introduction to computational Biology: Maps,
Sequences and Genomics, Chapman and Hall, London.
59. Westhead, D.R., Parish, J.H. and Twyman, R.M., 2003. Instant Notes:
Bioinformatics, BIOS Scientific Publishers Ltd., Oxford, UK.
60. Wilkins, N.R. (Ed.), 1997. Proteome Research: New Frontiers in
Functional Genomics, Springer-Verlag, Berlin.
61. Yap, T.K., Ffrieder, O. and Martino, R.L., 1996. High Performance
Computational Methods for Biological Sequence Analysis, Kluwer
Academic, Norwell.
Index
AACompIdent 7.15, 7.20 BLOCKS 2.10, 3.3, 5.17, 5.18, 6.11,

ab initio 1.11, 1.14, 7.7, 7.14, 7.20 6.19, 7.20, G.14
Accession number 2.13, 4.19, 5.3 BLOSUM 6.8, 6.10, 6.11, 6.19, G.2
5.4, 5.6 Branch lengths 8.4, 8.6, G.2
Acrocentric 3.8 Browser 2.6, 2.7, 2.13, 4.9, 5.7,
Active site 3.21, 3.22, 5.12, 5.17, 6.20, G.2
6.18, 7.15
Affine gap 6.5, G.1 Capping 3.10, 3.12
Algorithm 1.3, 1.4, 1.7-1.15, 3.22, 4.3, CATH 5.3, 5.13, 5.14, 5.19, 6.27
4.15, 6.2, 7.16, 9.8 cDNA 4.1, 4.8, 4.12, 4.14, 7.2, G.2
Alignment scores 6.8, 6.21, G.1 Central Dogma 3.4, 3.5, 4.8, G.2
Alphabets 1.10, 5.9 Centromere 3.8, 3.23
Altschul 1.5, 6.16 Chaperones 3.23, G.2
Annotations 1.7, 1.9, 1.14, 1.15, 2.2, Chargaff 1.2, 3.3
2.11-2.13, 5.3, 5.4, 5.6 Charles Darwin 8.1, 8.16
Archive 5.13, 5.15, G.1 Chou-Fasman 7.8
ASCII 2.9, G.2 Chromosome 1.5-1.9, 2.11, 3.1, 3.7,
Autosome 3.8 4.3, 4.22, 6.12, G.8
Clade 8.5, 8.6, 8.8
Base pair 3.7, 3.11, 3.12, 3.22, 4.24, Cladogram G.3
4.27, 5.7, 8.5 Clone 1.6, 4.2, 4.7, 4.10, G.5
Bibliographic database 2.11, 5.18, Cloning 1.8, 4.1, 4.11, 5.22, G.3
5.19, 5.20
CLUSTAL W 5.22, 6.21, 8.13
Bill Gates 1.3
Cluster analysis 6.22, G.3
Biocomputing 1.11, 2.9, 7.17
Codon 3.4, 3.12, 3.15, 4.8, 7.2, G.7
Biopolymer 2.2
Combinatorial 9.7, 9.10,
Bit 2.9, G.1, G.2
Comparative Genomics 4.2, 5.23
BLAST 1.5, 2.12, 5.7, 5.22, 6.11, 6.16,
Contigs 2.12, 4.7, 4.12, G.2
6.25, 9.3, G.2
Convergent evolution 8.2, G.1
I.2 Index
CORBA 5.23, G.4 Exons 3.9-3.12, 4.8, 4.10, 4.14, 7.2,
Craig Venter 1.5, 1.9 7.4, G.5
Crick 1.2, 3.3, G.2 ExPasy 2.6, 2.13, 4.21, 5.2, 5.24, 7.15
Curated 5.3, 5.10, 5.11
False-negative G.5
Dayhoff 1.2, 6.8-6.10 False positives 5.17, G.5
DBMS 2.9, 5.4, FASTA 1.5, 2.9, 5.3, 5.7, 6.11, 6.15-
DDBJ 2.11, 2.12, 5.6, 5.7, 5.18 6.17, 6.22, 8.13
Deletions 4.10, 5.13, 6.2, 6.7, 6.12, Folding pattern 3.20-3.23, 5.13,
6.19, G.12 5.23, 6.20
Dendrograms 8.8, 8.16, G.4 FTP 2.3, 2.6, 5.20
Discriminator 5.18, G.4 Functional genomics 4.2, 9.6, G.6
Distance measures 6.5 Functional Proteomics 4.4, G.6
DNA chips 1.6, 4.17
Docking 1.12, 7.1, 7.2, 9.7-9.11 Gap 4.7, 4.12, 5.21, 6.1-6.6, 7.12,
7.13, 9.6, G.1
Domain 1.4, 2.5, 2.12, 3.21, 5.4, 5.10-
5.12, 6.3, 6.17, 6.22, 7.4, 7.15, G.4 GenBank 2.11-2.13, 4.25, 4.27, 5.3,
5.7, 5.21, 6.22
Doolittle 6.6
Gene expression 1.10, 1.12, 1.14,
Dot matrix 1.3, 1.4, 6.11, 6.12, 6.16,
2.12, 3.4, 4.2, 4.15, 4.18, 4.20,
G.5
4.24, G6
2D PAGE 4.4, 4.20, 4.21
Genechips 4.18
Drug discovery 1.11, 1.13, 4.4, 4.22,
Generalized databases 5.2
9.1, 9.6
Genetic Algorithm 6.2, 9.10, G.7
Drug targets 1.10-1.13, 4.4, 4.22, 9.2,
9.5-9.7 Genetic code 3.3, 3.15, 3.16, 4.14, G.3,
G.7
Dynamic programming 4.23, 6.4, 6.5,
6.7, 6.11-6.15, G.5 Genetic mapping 2.9, 2.10, 4.2, 4.5
Genome 4.5, 4.7, 4.11, 4.24-4.29, 5.18,
5.23, G.7
E. coli 1.5, 1.6, 3.14, 4.10, 5.21
Genome mapping 4.4, 4.5
Edit distance 6.5
Genomics 1.4-1.10, 4.2, 4.5, 4.27, 5.9,
Edman degradation 1.8, 4.14, G.5
5.23, 9.5, 9.6, G.3, G.6
EMBL 1.4, 2.11-2.13, 4.25, 5.3, 5.6,
GenScan 7.3, 7.4
5.7, 5.21
Global alignment 6.2, 6.3, 6.4, 6.13,
EMBNET 2.9, 2.10, 2.11, 5.23
6.14, 6.15, G.7
EMBOSS 5.23, 5.24, 5.25
GRAIL 4.23, 7.3, 7.4
Entrez 2.11-2.14, 4.19, 5.2, 6.14, 8.13
Gregor Mendel 1.2, 3.1
Enzyme database 5.21
Erwin Schrodinger 1.2
Hamming Distance 6.5, 6.6
EST 1.5, 1.9, 2.10, 4.9, 4.12
Hemoglobin 3.22, 3.23, 8.14
Eukaryote 3.8, 3.9, 3.13, 4.23, 8.9
Heuristic 6.15, 6.16, 6.22, 6.25, G.7
Index I.3
HGP 1.9, 4.25, 4.27 Macromolecular 5.10, 5.13, 8.15, 9.9
High-throughput 1.9, 4.7, 5.8, 6.1, 9.5 Maximum likelihood 6.23 8.6,
HMM 4.23, 5.18, 6.11, 6.18, 6.20, G7 8.7, 8.11
hnRNA 3.10 Maximum parsimony 8.6, 8.7, 8.11
Homology 1.11, 1.13, 3.22, 4.4, 5.14, Medline 5.7, 5.12, 5.18
5.20, 6.2, 6.19, 7.15, 8.1, G7 Megabase 4.22, G.9
HTML 2.2, 2.8, G.7 Metacentric 3.8
Hydropathy 5.20, 5.21, 5.24, G.7 Microarray 1.12, 4.15, 4.17, G.9
Hydrophobicity 4.15, 7.16, 9.11, G.7 Microsoft access 5.5
Modules 5.10, 7.15, 8.3, 9.10
In silico 9.2, 9.8, G.8 MolMol 7.17, 7.19, G.9
INDEL 4.12, G.8 Monte Carlo 6.23, G.9
Information theory 7.8, G.8 Mosaic 2.7, 2.8, G.9
Insertion 5.13, 6.2, 6.3, 6.5, 6.12, 6.19, Motif 1.6, 3.21, 5.17, 5.23
G.8 mRNA 1.9, 3.9, 3.10, 3.12-3.18, 4.8,
Intranet 2.13, 2.14, 5.22, 5.23, G.8 7.1, G.13
Introns 3.9-3.12, 4.8, 4.10, 4.14, 6.1, Mutation 4.18, 6.7, 6.15, 8.7, 8.15
7.2, G.8
Isoelectric point 4.20, 7.16 NCBI 1.5, 1.7, 2.1, 2.11, 5.7, 8.13
Iterative G.8 Needleman 1.3, 6.4, 6.11, 6.13, 6.14
Neighbor-joining method G.9
Java 1.6, 2.2, 2.8, 5.24, 6.20, G.8 Netscape navigator 2.7, 2.8, 2.14
Jpred 7.16, 7.20 Neural network 4.19, 7.3,7.4, 7.8, 7.9,
7.16, G.9
Kary Mullis 1.4 NMR 1.4,4.14, 4.15, 5.15, 7.11
Karyotype G.8 nnpredict 7.9,7.16, 7.20
KEGG 5.18, 5.24 Node 2.4, 2.5, 2.9, 2.10, 6.22, 6.23,8.5
Kilobase G.8
Kozak sequences 4.9 Odds score 6.5, 6.10, 6.11, G.9, G.10
k-tuple 6.11, 6.15, G.8 OMIM 1.12, 5.2, 5.8, 5.9
Ontology 1.9, 8.2, G.9
Lavenshtein Distance 6.5, 6.6 Opsin 8.2
Ligand 5.2, 5.12, 5.16, 5.17, 9.4, 9.5, Optimal alignment 6.3, 6.6, 6.13,
9.7, G.9 6.14, 6.23, 7.13
Linux 1.5, 2.3, 5.5, 7.19 Oracle 5.5
Lipman 1.4, 1.5, 6.15 ORF 4.8, 4.9, 4.10, 4.23,5.22
Literature database 5.19 Ortholog 8.2, 8.3, 8.16, G.10
Local alignment 6.2, 6.4, 6.13, G9
log-odds 6.9, 6.10, G.9 Pairwise alignment 6.1, 6.4, 6.11,
6.12, 6.21
I.4 Index
PAM matrices 6.8, 6.9, 6.10, 6.11 Quaternary Structure 3.20, 5.13, G.12,
Paralog 8.2, 8.3, 8.16, G.11 G.14
Parametric sequence 6.3, G.11 Query 2.4, 2.10, 2.13, 4.10, 5.5, 5.7
Parsimony 8.6, 8.7, 8.8, 8.11, 8.16 Query sequence 4.10, 5.7, 5.18, 5.25,
PAUP 6.22, 8.16 6.4, 6.15
PCR 1.4, 1.8, 4.1, 5.22, 5.24
PERL 1.5, 2.2, 2.14, 6.5 RasMOL 7.17, 7.19
Pfam 5.17, 5.18, 5.19, 6.20, 6.25, 7.20 RDBMS 5.5
Phenetic 8.5, 8.10, 8.17 RNA editing 3.12, 4.14
PHYLIP 6.22, 8.6 8.7, 8.16 RNA polymerase 3.9, 3.10, 3.13,
Phylogenetic tree 1.10, 5.20, 6.8, 6.9, 3.14, 3.16, G.12
6.18, 6.21 Rosalind 1.2, 1.15, 3.3
Phylogeny 5.24, 8.1, 8.3, 8.8, 8.14, rRNA 3.9, 3.13, 3.14, 8.15 G.13
8.15
PIR 2.11, 2.12, 2.13, 5.3, 5.4, 5.10 Sage 4.16
Polyadenylation 3.10, 3.23, 4.22, Sanger 1.2, 1.3, 1.6, 1.8, 4.6, 5.9, 5.23
Polymorphism 4.5, 4.27, 5.2, 9.5, G.14 Scaffold 9.7, 9.8
Polypeptide 1.14, 3.4, 3.8, 3.12, 3.13, Scop 1.1 5.3, 5.13, 5.14, 5.19
3.16 Score 4.22, 6.3, 6.4, 6.5, 6.6, 6.7,
Prediction 1.6, 1.7, 1.11, 2.13, 4.8, 4.9 6.8, 6.9
Pre-mRNA 3.9, 3.12, 7.1 Secondary database 5.2, 5.3, 5.10,
Primary database 5.2, 5.3, 5.9, 5.10, 5.11, 5.16
5.16, 5.17 6.24 Sequence alignment 1.4, 1.13, 2.13,
Primary Structure 3.20, 3.21, 5.9, G.6 5.9, 5.10, 5.16
Primer 1.4, 1.13, 3.19, 3.20, 3.21, Sequence analysis 1.4, 2.2, 2.9, 4.1,
4.4, 4.25 4.24, 5.10
Prints 1.6, 1.15, 5.17, 5.18, 5.19 Server 1.4, 2.3, 2.4, 2.6, 2.7, 2.8
Profile 1.10, 2.13, 4.3, 4.18, 4.19, 4.24 Signature 3.11, 5.18, 6.10, 6.19, G.5
Programming 2.1, 2.2, 4.23, 5.23, Similarity 1.11, 1.12, 1.13, 2.10,
6.4, 6.5 2.13, 4.13
Prokaryote 3.8, 3.9, 3.13, 3.14, Smith waterman algorithm 6.4, 6.11,
4.9, 4.23 6.13, 6.16, G.14
Promoter 4.8, 4.22, 5.4, 5.21 SNP 1.7, 4.5, 5.2, 5.9, G.14
Prosite 1.3, 1.6, 2.10, 2.13, 5.3, 5.17 snRNPs 3.11, 3.12
Protein structure 1.4, 1.6, 1.7, 1.10, Specialised database 5.2
2.10, 2.11 Spliceosome 3.11, 3.12
Proteome 4.3, 4.4, 4.24, G 12 Splicing 3.10, 3.11, 4.10, 7.1
proteomics 1.13, 1.14, 2.13, 4.2, SRS 2.9, 2.10, 2.11, 5.7, 7.20
4.3, 4.4 Stanley Cohen 1.3
PSI-BLAST 6.11, 6.16, 6.17, 6.18, 6.20 String 2.2, 2.9, 3.22, 4.12, 6.4, 6.5
PubMed 2.11, 2.12, 5.16, 5.18, 5.19 Structural genomics 4.2
Index I.5
Structural proteomics 4.4, 9.9 tRNA 1.8, 3.9, 3.12, 3.13, 3.16, 5.21
Structure database 5.2, 5.10, 5.13,
5.14, 5.15, UNIX 2.2, 2.3, 2.8, 5.20, 5.21,
Structure prediction 1.7, 4.15, 5.20, 5.24, 8.13
5.21, 5.22, 6.19 UPGMA 8.7, 8.10
STS 5.2, 5.8, 5.9, G.13 URL 2.8, 2.10, 2.11, 2.13, 4.9, 4.19, 5.7
Substitution 6.7, 6.10
Swedberg unit 3.13 Variation 1.7, 2.3, 4.3, 5.17, 6.9
SWISS-PROT 2.11, 2.12, 2.13, 5.3, 5.7, VAST 5.16, 6.23
5.11, 5.12 Virtual library 5.20, 5.25, 9.10
Visualization 1.11, 5.15, 5.23, 5.24,
Tanimoto Coefficient 9.6 6.21, 7.17
Target 1.10, 1.11, 1.12, 1.13, 4.4, 4.22
Taxon 8.5, 8.6 Watson 1.2, 1.5, 3.3, 3.5, 4.25, 4.27
Telocentric 3.8 web page 2.2, 2.6, 2.7, 2.9, 6.27
Tertiary structure 2.12, 3.19, 3.20, website 1.9, 2.7, 5.3, 5.13, 5.19, 6.24
3.21, 5.4 Wobble 4.10
Threading 7.11, 7.13, 7.21, 9.7, 7.11, Wuthrich 1.4
7.13
Tiselius 1.2
Xenolog 8.2, 8.16
Topoisomerase 4.22
X-ray crystallography 1.2, 4.14, 9.9,
Topology 2.13, 5.14, 6.23, 8.3, 8.8, G.13
8.10
Transcript 2.10, 3.4, 3.9, 3.15, 3.22,
Yahoo 2.7, 2.14
4.10
α-helix 1.2, 7.8,
Translation 1.12, 3.4, 3.15, 3.17, 4.8,
4.9, 4.10 β-sheet 1.2, 3.22
TrEMBL 2.13, 5.12, 5.15 β-turn 3.20, 7.10, 7.11, 7.20

Basic Bioinformatics - S. Ignacimuthu

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Bioinformatics - S. Ignacimuthu

Uploaded by

Copyright:

Available Formats

Contents i

All rights reserved. No part of this publication may be reproduced, stored

Preface to the First Edition

Bioinformatics is an interdisciplinary subject. It is the science of using

I am thankful to many of my friends who constantly encouraged me to write

Preface to the Second Edition vii

In its broadest sense, the term bioinformatics can be considered to mean

(v) Bioinformatics is the application of the methods of computational

1.1 IMPORTANT CONTRIBUTIONS

1.2 SEQUENCING DEVELOPMENT

1.3 AIMS AND TASKS OF BIOINFORMATICS

1.4 APPLICATION OF BIOINFORMATICS

1.4.1 Sequence Homology Analysis

1.4.2 Drug Design

1.4.3 Predictive Functions

1.4.4 Medical Areas

1.4.5. Intellectual Property Rights

Genomics and Proteomics

Drug design by modeling which involves computer and computation can

Table 1.1. Some examples of patents in bioinformatics

1.5 CHALLENGES AND OPP ORTUNITIES

Computers, Internet, World

2.1 COMPUTERS AND PROGRAMS

2.3 WORLD WIDE WEB

Web Pages and Websites

2.4 BROWSERS AND SEARCH ENGINES

Netscape navigator and Internet Explorer

2.5 EMBNET AND SRS

Nodes and Sites

Table 2.1: EMBnet Associate Nodes

Abbreviation Country Site

MIPS/GSF Germany http://mips.gsf.de/

Sequence Retrieval System

Table 2.2: The databases covered by Entrez, listed by category.

Retrieval and Application

Mirrors and Intranet

DNA, RNA and Proteins

Contribution from Biochemists

DNA RNA Protein

DNA Triplet RNA Triplet Amino Acid Specified

TAC AUG “Start”

Old New Old New

An organism’s basic complement of DNA is called its genome. The

Regulatory Coding region Transcription

Coding region (exons)

The processing of pre mRNA also includes modification of the 5’ end

Primary RNA transcript

3.4 TRANSCRIPTION AND TRANSLATION

U UUU Phe UCU Ser UAU Tyr UGU Cys U

CUG CCG CAG Gin CGG G

The Nature of Chemical Bonds

Polypeptide aa1 aa2

aa1 aa2 aa3

The higher an atom’s affinity for electrons, the higher its

The four naturally occurring nucleotides in DNA and RNA

Table 3.3: Class and characteristics of protein structures

DNA and Protein Sequencing

Contributions from the field of biology and chemistry have facilitated an

Structural, Functional and comparative Genomics

Whole genomic Comparative Functional

Structural and Functional Proteomics

4.2 GENOME MAPPING

DNA sequences are stored in databases. Genomic DNA sequences, copy

Exons, Introns and CDS