Professional Documents
Culture Documents
Basic Bioinformatics
Second Edition
ii Contents
Basic Bioinformatics
Second Edition
S. Ignacimuthu, s.j.
α
Alpha Science International Ltd.
Oxford, U.K.
Basic Bioinformatics
Second Edition
242 pgs. | 54 figs. | 16 tbls.
S. Ignacimuthu, s.j.
Director
Entomology Research Institute
Loyala College, Chennai
Copyright © 2013
ALPHA SCIENCE INTERNATIONAL LTD.
7200 The Quorum, Oxford Business Park North
Garsington Road, Oxford OX4 2JZ, U.K.
www.alphasci.com
ISBN 978-1-84265-804-8
E-ISBN 978-1-84265-978-6
Printed in India
Contents v
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Dedicated to
Rev. Fr. Adolfo Nicolas, S.J.
the Superior General
of the Society of Jesus
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
vi Contents
Preface to the Second Edition
As I thank the readers for their tremendous support for my book 'Basic
Bioinformatics', I am happy to bring out the second edition of this book for the
benefit of the readers. In recent years bioinformatics has been gaining
importance. Being an interface between modern biology and informatics, it
involves the discovery, developments and use of computational algorithms
and software tools that facilitate an understanding of the biological processes
with a goal to serve healthcare and other sectors of human endeavours.
From the time Paulien Hogeweg and Ben Hesper coined the word
bioinformatics in 1978 to refer to the study of information processes in biotic
systems, rapid developments have taken place in mapping and analyzing
DNA and protein sequences, developing new databases, aligning different
sequences, comparing them, viewing 3-D models of protein structures,
studying the molecular interaction and carrying out drug discovery analyses.
I am immensely happy to present the revised edition which includes all
the up to date basic information relating to different areas of bioinformatics
along with some procedures to have hands on experience. I am sure the
students and teachers will greatly benefit from this book.
S. Ignacimuthu, s.j.
viii Contents
Contents ix
Acknowledgements
Contents
1
History, Scope and
Importance
Definitions
Bioinformatics is defined in various ways. Some of the definitions are as
follows:
(i) Bioinformatics is the use of computer in solving information problems
in life sciences; mainly it involves the creation of extensive electronic
database on genomes and protein sequences. Secondarily it involves
techniques such as the three-dimensional modeling of biomolecules
and biological systems.
(ii) Bioinformatics is a computational management of all kinds of biological
informations, including genes and their products, whole organisms or
even ecological systems.
(iii) Bioinformatics is an integration of mathematical, statistical and
computational methods to analyse biological, biochemical and
biophysical data. It deals with methods of storing, retrieving and
analyzing biological data, such as nucleic acid and protein sequences,
structures, functions, pathways and genetic interactions.
(iv) Bioinformatics is the storage, manipulation and analysis of biological
information via computer science. Bioinformatics is an essential
infrastructure underpinning biological research.
1.2 Basic Bioinformatics
1980 Mark Skolnick, Ray White, David Botstein and Ronald Davis created
RFLP marker map of human genome.
• The first complete gene sequence for an organism (FX 174) was
published.
• Wuthrich et al. published a paper detailing the use of
multidimensional NMR for protein structure determination.
• IntelliGenetics Inc. was founded in California. Their primary
product was the IntelliGenetics Suite of programs for DNA and
protein sequence analysis.
• The Smith – Waterman algorithm for sequence alignment was
published.
• US Supreme Court holds that genetically – modified bacteria are
patentable.
1981 IBM introduced its personal computer to the market
• Human mitochondria DNA was sequenced
• D. Benson, D. Lipman and colleagues developed a menu-driven
program called GENINFO to access sequence database.
• Maizel and Lenk developed various filtering and color display
schemes that greatly increased the usefulness of the dot matrix
method.
1982 First recombinant DNA – based drug was marketed
• Genetics Computer Group (GCG) was created as a part of the
University of Wisconsin at Wisconsin Biotechnology Center.
1983 The Compact Disk (CD) was launched
• Name servers were developed at the University of Wisconsin
1984 Jon Postel’s Domain Name System (DNS) was placed on-line. Apple
computer announced the Macintosh.
1985 Kary Mullis invented PCR
• FASTP algorithm was published
• Robert Sinsheimer made the first proposal for Human Genome
Project
1986 Thomas Roderick coined the term Genomics to describe the scientific
discipline of mapping, sequencing and analyzing genes.
• Amoco Technology Corporation acquired IntelliGenetics. The
Swiss-PROT database was created by the Department of Medical
Biochemistry of the University of Geneva and the European
Molecular Biology Laboratory (EMBL)
• Leroy Hood and Lloyd Smith automated DNA sequencing.
• Charles DeLisi convened a meeting to discuss the possibility of
determining the nucleotide sequence of human genome.
• NSFnet debuts
History, Scope and Importance 1.5
1987 United States Department of Environment (US DoE) officially began
human genome project.
• The physical map of E. coli is published by Y. Kohara et al.
1988 The use of yeast artificial Chromosome (YAC) is described by David T.
Burke et al.
• Pearson and Lipman published the FASTA algorithm
• The National Centre for Biotechnology Information (NCBI) was
established at the National Cancer Institute in the US.
• PERL (Practical Extraction Report Language) was released by
Larry Wall
• United States National Institute of Health (US NIH) took over
genomic project with James Watson at the helm.
• The Human Genome Initiative was started
• Des Higgins and Paul Sharpe announced the development of
CLUSTAL
• A new program, an internet computer virus designed by a
student, infected 6000 military computers in the USA
1989 NIH established National Centre for Human Genome Research.
• The Genetics Computer group became a private company
• Oxford Molecular Group Ltd (OMG) founded in Oxford, UK,
created products such as Anaconda, Asp, Cameleon and other
(molecular modeling, drug design, and protein design) products.
1990 The BLAST programme to align DNA sequences was developed by
Altschul et al.
• Michael Levitt and Chris Lee founded Molecular Applications
Group in California.
• InforMax was founded in Bethesda, MD
• The HTTP 1.0 specification was published. Tim Berners – Lee
Published the first HTML document.
1991 CERN, Geneva announced the creation of the protocols which make
up the World Wide Web.
• Craig Venter invented expressed sequence tag (EST) technology
• Incyte Pharmaceuticals, a genomics company was formed in
California.
• Myriad Genetics Inc. was founded in Utah with a goal of
discovering major common disease genes and their related
pathways.
• Lius Torvelds announced a Unix – Like separating system which
later became Linux.
1992 Human Genome systems, Maryland was formed by William Haseltin
• Craig Venter established the Institute for Genomic Research (TIGR).
1.6 Basic Bioinformatics
• Mel Simon and coworkers (Cal Tech) invented BACs, crucial for
clone by clone gene assembly.
• Wellcome Trust joined human genome project
1993 Francis Collins took over Human Genome project. Sanger Center is
opened in UK. Other nations joined in the effort. 2005 was projected as
completion year.
• CuraGen Corporation was formed in New Haven, CJ.
1994 Netscape Communications Corporation was founded and it released
Navigator.
• Attwood and Beck published the PRINTS database of protein
motifs.
• Gene Logic is formed in Maryland
1995 Researchers at the Institute for Genomic Research published the first
genome sequence of free-living organism: Haemophilus influenzae.
• Patrick Brown and Stanford university colleagues invented DNA
micro-array technology.
• Microsoft released version 1.0 of Internet Explorer
• Sun released version 1.0 of Java and Netscape released version 1.0
of Java script; version 1.07 Apache was released
• The Mycoplasma genitalium genome was sequenced
1996 The genome of Saccharomyces cerevisiae was sequenced.
• International Human Genome project consortium established
‘Bermuda rules’ for public data release.
• Prosite database was reported by Bairoch et al.
• Affymetrix produced the first commercial DNA chips.
• The working draft for XML was released by W3C
• Structural Bioinformatics, Inc. was founded in San Diego, USA
1997 The genome for E. coli was published
• Oxofed Molecular Group acquired the Genetics Computer Group.
• LION bioscience AG was founded.
• Paradigm Genetics Inc, was founded in North Carolina, USA
- DeCode genetics maped the gene linked to pre-eclampsia
1998 The genomes for Caenorhabditis elegans and baker’s yeast were
published
• Graig Venter forms Celera in Maryland
• Inphamatica, a new Genomics and Bioinformatics company was
established by the University College, London.
• Gene Formatics, a company dedicated to the analysis and
prediction of protein structure and function was formed in San
Diego.
• The Swiss Institute of Bioinformatics was established as a non-
profit foundation
History, Scope and Importance 1.7
• NIH began SNP project to reveal human genetic variation.
• Celera Genomics proposed to sequence human genome faster and
cheaper than consortium.
1999 Wellcome Trust formed SNP consortium
• First Human Chromosome sequence was published.
2000 The genomes of Pseudonomas aeruginosa, Arabidopsis thaliana and
Drosophila melanogaster were sequenced.
• Pharmacopeia acquired Oxford Molecular Group.
2001 Science and Nature published annotations and analysis of human
genome by mid February.
2002 More genome sequences of other organisms were published.
• Structural bioinformatics and GeneFormatics merged
• Full genome sequence of the common house mouse was published
2004 Rat Genome sequencing project consortium completed the genome
sequence of brown Norway laboratory rat.
2005 4,20,000 Variant SEQr human resequencing sequences were published
on new NCBI probe database
2007 A set of closely related 12 Drozophilidae were sequenced
• Craig Venter published the full diploid genome sequence
2008 Leiden university Medical Center deciphered the completed DNA
sequence of a woman
• G.P.S. Raghava from IMTECH, India developed softwares and
databases for protein structure prediction, genome annotation
and functional annotations of proteins.
All the above mentioned developments have contributed significantly to the
growth of bioinformatics in one way or another.
Initial Attempts
Initially a majority of protein sequences were obtained by the manual process
of sequential Edman degradation – dansylation. A very important step
towards the rapid increase in the number of sequenced proteins was the
development of automated sequences which, by 1980, offered a 104 fold
increase in the sensitivity compared to the procedure implemented by Edman
and Begg in 1967.
The first complete protein sequence assignment using mass spectrometry
was achieved in 1979. This technique played a vital role in the discovery of the
amino acid γ-carboxyglutamic acid, and its location in the
N-terminal region of prothrombin.
During 1960s and 1970s scientists were finding it difficult to develop
methods to sequence nucleic acids. When the techniques were available, the
first techniques to emerge were applicable only to RNA (ribonucleic acid),
especially transfer – RNAs (tRNA). tRNAs were ideal materials for this early
work, because they were short (typically 74-95 nucleotides in length), and
because it was possible to purify individual molecules.
Advanced Techniques
DNA (deoxyribonucleic acid) consists of thousands of nucleotides and
assembling the complete nucleotide sequence of an entire chromosomal DNA
molecule is a very big task. With the advent of gene cloning and PCR, it
became possible to purify defined fragments of chromosomal DNA. This
paved the way for the development of fast and efficient DNA sequencing
techniques.
By 1977, two sequencing methods had emerged, using chain
termination and chemical degradation approaches. These techniques with
some minor modifications laid the foundation for the sequence revolution of
the 1980s and 1990s and the subsequent birth of bioinformatics.
The polymerase chain reaction (PCR) due to its sensitivity, specificity
and potential for automation, is considered the front-line analytical method
for analyzing genomic DNA samples and constructing genetic maps. Over
the years, incremental improvements in basic PCR technology have enhanced
the power and practice of the technique.
Since the introduction of the first-semi-automated sequence in 1987,
coupled with the development of PCR in 1990 and fluorescent labeling of
DNA fragments generated by the Sanger dideoxy chain termination method,
there have been large-scale sequencing efforts which have contributed
greatly. Technologies for capturing sequence information have also become
advanced over a period of time.
In the early 1980s, researchers could use digitizer pens to manually read
DNA sequences from gels. Then came image-capture devices, which were
cameras that digitized the information on gels. In 1987 Steven Krawetz, helped
to develop the first DNA sequencing software for automated film readers.
History, Scope and Importance 1.9
In the early 1990s, J. Craig Venter and his colleagues devised a new
method to find genes. Rather than taking the single base chromosomal DNA,
Venter’s group isolated messenger RNA molecules, copied these mRNA
molecules into DNA molecules and then sequenced a part of the DNA
molecule to create expressed sequence tags or ESTs. These ESTs could be used
as handles to isolate the entire gene.
The EST approach also has generated enormous databases of nucleotide
sequences and the development of the EST technique is considered to have
demonstrated the feasibility of high-throughput gene discovery, as well as
provided a key impetus for the growth of the genomics industry.
Sequence Deposits
At the start of 1998, more than 3,00,000 protein sequences have been
deposited in publicly available non-redundant data bases, and the number of
partial sequences in public and proprietary Expressed Sequence Tag (EST)
databases was expected to run into millions. By contrast, the number of 3D
structures in the Protein Data Bank (PDB) is still less than 20000.
The United States Department of Energy (DoE) initiated a number of
projects in 1980s to construct detailed genetic and physical maps of the
human genome. Their aim was to determine the complete nucleotide
sequence of human genome and to localize the estimated 30,000 genes.
Work of such a great dimension required the development of new
computational methods for analyzing genetic map and DNA sequence data,
and demanded the design of new techniques and instrumentation for
detecting and analyzing DNA.
To benefit the public most effectively, the projects also necessitated the
use of advanced means of information dissemination in order to make the
results available as rapidly as possible to scientists and physicians. The
international effort arising from this vast initiative became known as the
Human Genome Project (HGP).
Useful Websites
A very useful guide can be found in the website: http://www.genome.gov/
Education/
Overview of the role, history and achievements of the US Department
of Energy in the HGP can be found in the website: http://
genomics.energy.gov/
Genome Annotation Consortium (GAC) provides comprehensive
sequence-based views of a variety of genomes in the form of an illustrated
guide, with progress charts, etc., and it can be found in the website: http://
www.geneontology.org/GO.refgenome.shtml
Mapping and sequencing the genomes of a variety of organisms have
been taken up and this can be found in the website: http://www.ornl.gov/
sci/techresources/ Human_Genome/publicat/primer/prim2.html
1.10 Basic Bioinformatics
Aims
The aims of bioinformatics are as follows:
(i) To organize data in a way that allows researchers to access existing
information and to submit new entries as they are produced.
(ii) To develop tools and resources that aid in the analysis of data.
(iii) To use these tools to analyze the data and interpret the results in a
biologically meaningful manner.
Tasks
The tasks in bioinformatics involve the analysis of sequence information. This
process involves:
• identifying the genes in the DNA sequences from various organisms.
• Developing methods to study the structure and/or function of newly
identified sequences and corresponding structural RNA sequences.
• Identifying families of related sequences and the development of
models.
• Aligning similar sequences and generating phylogenetic trees to
examine evolutionary relationships.
Besides these, one of the important dimension of bioinformatics is identifying
drug targets and pointing out lead compounds.
Areas
Bioinformatics deals with the following areas:
(i) Handling and management of biological data including its
organization, control, linkages, analysis and so on.
History, Scope and Importance 1.11
(ii) Communication among people, projects, and institutions engaged in
the biological research and applications. The communication may
include e-mail, file transfer, remote login, computer conferencing,
electronic bulletin boards, or establishment of web-based information
resources.
(iii) Organization, access, search and retrieval of biological information,
documents, and literature.
(iv) Analysis and interpretation of the biological data through the
computational approaches including visualization, mathematical
modeling, and development of algorithms for highly parallel
processing of complex biological structures.
the probable amino acid sequence of the encoded protein can be determined
using translation software.
Sequence research techniques could then be used to find homologues in
model organisms; and based on sequence similarity it is possible to model the
structure of the specific protein on experimentally characterized structures.
Finally, docking algorithms could design molecules that could bind to the
model structure, leading the way for biochemical assays to test their
biological activity on the actual protein.
Innovations
Majority of bioinformatics innovation involves applications of computer-
implemented protocols or software in collecting and/or processing biological
data. These inventions fall within the general category of computer related
inventions called inventions implemented in a computer and inventions
employing computer readable media. These inventions have two aspects (a)
software and (b) hardware.
For example, a computer based system for indentifying new nucleotide
sequence clusters from a given set of nucleotide sequences based on sequence
similarity may comprise an input device, a memory and a processor as
hardware components of the system and a data set or method of operating
instructions stored in the memory and operable by the processor as a
software for the system. Patent protections would be invaluable in protecting
methods, which use computational power, such as sequence alignments,
homology searches and metabolic pathways modeling.
Computers are now an integral part of the biological world and without
them advancements in biology and medicines would undoubtedly be
hindered greatly. Computers are essential for the management of ever-
growing biological data.
Internet is a communication revolution. Web has been instrumental in
making Internet a success. It allows the user to move freely anywhere on this
single largest source of information highway. Computers are handling large
quantities of data and help in probing the complex dynamics observed in
nature.
The data can be organized in flat files and spread sheet. They can be
stored in hierarchical files and relational files.
Programming Languages
There are many programming, scripting and markup languages which are
popular with bioinformaticists. HTML is a language used to specify the
appearance of a hypertext document, including the positions of hyperlinks.
HTML is not a programming language.
Java Script is a popular scripting language that adds to the functionality
of hypertext document, allowing web pages to include such features as pop-up
windows, animations and objects that change in appearance when the mouse
cursor moves over them.
Java is a versatile and portable programming language that is designed
to generate applications that can run on all hardware platforms. The Java
source code is C++. Java is different from Java Script. Java applet is used in
hypertext document. PERL (Practical Extraction and Reporting Language) is
a versatile scripting language which is widely used in the analysis of
sequence data. XML (Extensible Markup Language) allows files to be
described in terms of the type of data they contain.
PERL and PYTHON are the most suitable languages for the work of
bioinformatics due to their efficiency and ability to meet diverse functional
requirements of the field. PERL was invented by Larry Wall using languages
like sed, awk, UNIX shell and C.
PERL can do excellent pattern matching, has a flexible syntax or
grammar and requires fewer codes for programming. It is good at string
processing, i.e. doing things like sequence analysis and database management.
It takes care of memory allocation. It has smooth integration with UNIX based
system. It is available free from the NET to copy, compile and print. PERL can
be downloaded from its home page: http://www.perl.org/.
PYTHON is a complete subject oriented scripting language developed by
Guido Van Rossum in 1998. It has tools for quick and easy generation of
graphical user interface, a library for functions of structural biology and a
mature library for numerical methods.
Bioinformatic Sequence Markup Language (BSML) graphically describes
genetic sequences and methods for storing and transmitting encoded sequence
and graphic information. Biopolymer Markup Language (BIOML) is a data
type definition for the annotation of molecular biopolymer sequence
information and structure data.
Operating Systems
The operating system is a master program that manages all peripheral
hardware and allows other software applications to run. BIOS (Basic Input-
Output System) is a low-level operating system which is largely or entirely in
firmware (i.e. software stored in read-only memory).
Computers, Internet, World Wide Web and NCBI 2.3
BIOS handles activities such as deciding what to do when the computer
is switched on after a cold start, reading and writing to disks, responding to
input, displaying readable characters on the monitor and producing
diagnostics. The higher-level operating system then takes over, and the
computer acquires a typical graphical user interface (GUI) such as Windows.
Files that contain instructions for the operating system are called batch files
in Windows and Shell scripts in UNIX systems.
Windows owned by Microsoft Corporation is the most familiar operating
system on home and office PCs. Most commercial workstations and servers
run under variations of an operating system called UNIX. GNU and LINUX
conform to UNIX standard.
The operating system allows one to have an access to the available files
and programs. UNIX is a powerful operating system for multi-user
component environment. The software that powers the web was invented on
UNIX. UNIX is rich in commands and possibilities, which includes
everything from networking software to word processing software and from
e-mail to newsreaders. It also provides free access to downloading of
programs installed on the UNIX systems. UNIX has many varieties and
versions.
LINUX is regarded as an open source version of UNIX, as it can be
downloaded and installed free of cost. Under LINUX, the PCs prove to be
highly elastic and useful workstations. It is also enabled with important
packages for computational biology. IBION is a recent, complete and self-
contained bioinformatics system. It is a ground breaking server, an appliance
for bioinformatics that has apache web server, a postgreSQL relational
database, the R statistical language on an Intel-based hardware system with
preinstalled LINUX and a comprehensive suite of bioinformatics tools and
databases.
Usually computer software is obtained on floppy disks or compact
disks (CDs). A file is downloaded when it is copied from a remote source
onto a local computer. A file is uploaded when it is copied from a computer’s
hard drive to a remote source. Downloading from the internet is achieved in
the following three ways: (i) directly from a hypertext document, (ii) from an
FTP server or (iii) by e-mail.
2.2 INTERNET
The interplay between the Internet, the World Wide Web, and the global
network of biological information and service providers has made the
bioinformatics revolution possible. The Internet is a global network of
computers and computer networks that links government, academic and
business institutions. This allows computers to talk to each other in their own
electronic languages. Biological information is stored on many different
computers around the world. The easiest way to access this information is to
join all those computers in a network.
2.4 Basic Bioinformatics
Computers are connected in a variety of ways, most commonly by
telephone cables and satellite links, thus allowing data to be exchanged
between remote users. In order to function effectively, the networks share a
communication protocol called Transmission Control Protocol/Internet
Protocol, better known as TCP/IP. TCP determines how data are broken into
packages and reassembled. IP determines how the packets of information are
addressed and routed over the network. Such a shared pattern of
communication means that different types of machines are able to speak to
each other in a common way.
Computers within the network are referred to as nodes, and these
communicate with each other by transferring data packets. For transfer, data
are first broken into small packets (units of information), which are sent
independently and reassembled when they arrive at their destination. But
packets do not necessarily travel directly from one machine to another; they
may pass through several computers on route to their final destination. Even
if any of the nodes on the way are down, the network protocols are designed
to find an alternative route because of the availability of different routes.
Access
The Internet provides a means to distribute software and enables researchers
to perform sophisticated analysis on remote servers. Till the late 1980s, there
were mainly three ways of accessing databases over an Internet: electronic
mail servers, File Transfer Protocol (FTP) and TELNET sever. E- mail serves
as a means of communicating text messages from one’s computer to some
other computer. FTP is a means of transferring computer files such as
programs from remote machines. TELNET is an internet protocol that allows
the user to connect to computers at remote locations and use these computers
as if they were physically operating the remote hardware.
Electronic mail services allow researchers to send an electronic mail
query to the mail server’s Internet address. The researcher’s query will then
be ceased by the cover, and the result will be sent back to the sender’s
mailbox. However, it had its own disadvantages such as poor querying with
errors and too much time. With File Transfer Protocol, the researcher could
download the entire databases search locally. This too has its own drawback
that a researcher should have to download each and every database after
each update.
TELNET allows a user to remotely log onto a computer and access its
facilities. This method is useful for occasional queries. This has its own
disadvantages such as extensive management of user identifications and
overloading of remote computer’s processing power.
Origin
The true origins of the Internet lie with a research project on networking at
the advanced Research Project Agency (ARPA) of the US Department of
Defense in 1969, named ARPAnet. The original ARPAnet connected for the
first time four nodes from different places in the US West Coast, with the
Computers, Internet, World Wide Web and NCBI 2.5
immediate goal of rapid exchange of scientific data on defense-related research
between laboratories.
In 1981, BITnet (Because It’s Time) was introduced, providing point-to-
point connections between universities for the transfer of electronic mails and
files. In 1982, ARPA introduced the TCP/IP allowing different networks to
be connected to and communicate with one another.
Address
Once the machines on a network have been connected to one another, there
must be an unambiguous way to specify a single computer so that messages
and files actually find their intended recipient. To facilitate communication
between nodes, each computer on the Internet is given a unique, identifying
number (its IP address). IP address is unique, identifying only one machine.
It is encoded in a dotted decimal format. For example, one node on the
internet might have the IP address: 130.14.25.1. These numbers represent the
particular machine, the site where the machine is located, and the domain
(and sub domain) to which the site belongs. These numbers help computers
in directing data.
An alternative, hierarchical domain-name system has also been
implemented, which makes Internet addresses easier to decipher. For
example, ncbi.nlm.nih.gov represents the above numbers meaning National
Centre for Biotechnology and Information (NCBI), at National Library of
Medicine (NLM) at National Institute of Health (NIH) and at Government
site (gov).
A complete list of domain suffixes, including country codes, can be found
a t h t t p : / / w w w . c h r i s t c e n t e r e d s t o r e . c o m /
international_domain_extensions_and_suffixes.htm,
http://iwantmyname.com/domains/domain-name-registration-list-of-
extensions.
Connectivity
Normally we can get connected to the Internet through a modem which uses
the existing copper twisted cables carrying telephone signals to transmit
data. Data transfer rates using modem are relatively slow (28.8 to 56 kilobits
per second, [kbps]. A number of new technologies are available for faster
transfer of data. Integrated services digital network (ISDN) is one such
technology but it is costly.
Other cost effective alternatives are using television coaxial cables
which are not used to transmit television signals and hence free to transmit
data at high speed (4.0 megabits per second (Mbps)). Later digital subscriber
line (DSL) with high speed (up to 7 Mbps) and asynchronous DSL (ADSL)
were available. Some of the newer technologies involve wireless and satellite
connections to the Internet.
Most of the people commonly use Internet for electronic mail (e-mail),
newsgroups, file transfer and remote computing. E-mail deals with
2.6 Basic Bioinformatics
communication between individuals; newsgroups are concerned with remote
computing, involving the use, for example, of the File Transfer Protocol (FTP) to
transfer files between machines, and the Telnet protocol, by which users may
connect to computers at different sites and use the machines as if physically
present at the remote location.
The most exciting use of internet is the communication between users in
real-time. These include the UNIX talk protocol (or VMS phone), which is
analogous to holding a telephone conversation, but users speak to each other
by typing into a shared screen. An extension of this concept is conferencing,
whereby groups of people meet and ‘talk’ to each other, again by typing into
a shared interface.
Object Web
Object web is designed to support highly functional and interactive systems.
It is a multi-tier architecture that contains two objects and communication
layer. One object may represent the user interface, and the other may provide
some computation. To communicate between the two objects, it is necessary
to define the messages they might receive.
The messages between two or more objects are mediated by a special
piece of code (an Object Request Broker (ORB) on each machine capable of
understanding the message of definitions and able to translate them into the
specific language of each object. With the object web a system can be broken
down into its constituent components written in different languages and
running on different hardware systems.
The Common Object Request Broker Architecture (COBRA) provides the
standards that make this communication possible. It provides a language to
define the structure of the messages, the Interface Definition Language (IDL),
and the architecture for the mediators, the ORBs. ORBs transparently hide all
of the communication between distributed objects, and form the backbone
(wiring for the object web).
Hyperlinks
Hyperlinks are usually characterized by being highlighted in some way,
either by using a different color from the main body of the text or by being
boxed etc. Selecting a highlighted link calls up the linked document,
regardless of its location, whether on the same server, or on a server in a
different country. Communication between hyperlinks is transparent.
Each hypertext document has a unique address known as a uniform
resource locator (URL). URLs take the format http://restofaddress. The
communication protocol used by web servers is Hyper Text Transport Protocol
or http. Rest of address provides a location for the hypertext document on the
Internet.
HTML
Hyper text documents are written in a standard markup language known as
Hyper Text Markup Language or HTML. HTML code is strictly text-based, and
any associated graphics or sounds for that document exist as separate files in
Computers, Internet, World Wide Web and NCBI 2.9
a common format. Markups instructions permit the web author to render in
bold type (the <B> symbol), to insert horizontal rulers (<HR>), images
(<IMG>), and so on; each of these modes is switched off with the relevant </>
symbol (e.g. </B>).
Another technology to support the creation of a functional genetic data
warehouse is XML. XML stands for extensive markup language. SML,
HTML, can build web pages. XML tags data in a way that any application
can use. It provides a general language for representing data in a standard
format. It allows files to be described in terms of the types of data they
contain.
XML is more flexible and robust. It provides the method for defining
the meaning or semantics of the document. It has the advantage of
controlling not only how data are displayed on a www page, but also how
the data are processed by another program or by a database management
system (DBMS).
Entrez
Entrez is the integrated, text based search and retrieval system. Just like SRS
for EMBnet, Entrez facility was evolved at NCBI to allow retrieval of
molecular biology data and bibliographic citations from NCBI’s integrated
databases. Entrez permits related articles in different databases to be linked
to each other, whether or not they are cross-referenced directly.
Entrez provides access to DNA sequence (from GenBank, EMBL and
DDBJ), protein sequence (from SWISS-PROT, PIR, PRF SEQDB, PDB and
translated protein sequence from the DNA sequence databases), genome and
chromosome mapping data, 3D protein structures from PDB, and the
PubMed bibliographic database.
Links between various databases are a strong point of NCBI’s system.
The starting point for retrieval of sequence and structure is called Entrez. It is
a www-based data retrieval system. It integrates information held in all
NCBI databases. It is the common front-end to all the databases maintained by
the NCBI and it is extremely easy to use. In total, Entrez links to 11 databases
(Table 2.2). Entrez can be accessed via the NCBI web site at the following URL:
http://www.ncbi.nlm.nih.gov/Entrez/
Data Model
The NCBI introduced the use of model for sequence-related information. This
made possible the rapid development of software and the integration of
databases that underlie the popular Entrez retrieval system and on which the
GenBank database is built. The advantages of the model are the ability to move
effortlessly from the published literature to DNA sequences to the proteins they
encode, to chromosome maps of the genes, and to the three-dimensional
structures of the protein.
2.12 Basic Bioinformatics
Category Databases
1. Nucleic acid sequences Entrez nucleotides: sequences obtained from GenBank, RefSeq and
PDB
2. Protein sequences Entrez protein: Sequences obtained from SWISS-PROT, PIR, PRF,
PDB, ad translation from annotated coding regions in GenBank and
RefSeq.
3. 3D structures Entrez Molecular Modeling Databases (MMDB)
4. Genomes Complete genome assemblies from many sources
5. PopSet From GenBank, set of DNA sequences that have been collected to
analyze the evolutionary relatedness of a population
6. OMIM Online Mendelian Inheritance in Man
7. Taxonomy NCBI Taxonomy Database
8. Books Bookshelf
9. ProbeSet Gene Expression Omnibus (GEO)
10. 3D domains Domains from the Entrez Molecular Modeling Database (MMDB)
11. Literature PubMed
The NCBI data model deals directly with a DNA sequence and a
protein sequence. The translation process is represented as a link between the
two sequences rather than an annotation on one with respect to the other.
Protein related annotations, such as peptide cleavage products, are
represented as features annotated directly on the protein sequence. In this
way, it becomes very natural to analyze the protein sequences derived from
translations of CDS features by BLAST or any other sequence search tool
without losing the precise linkage back to the gene. A collection of a DNA
sequence and its translation products is called Nuc-prost set.
The NCBI data model defines a sequence type as a segmented sequence.
GenBank, EMBL and DDBJ represent constructed assemblies of segmented
sequences as contigs. Entrez shows this as a line connecting all its component
sequences.
Bioseq
The Bioseq, or biological sequence, is a central element in the NCBI data
model. It comprises a single, continuous molecule of nucleic acid or protein,
thereby defining a linear, integer coordinate system for the sequence. A
sequence cannot is a self-contained package of sequence annotations or
information that refers to specific locations on specific Bioseqs. Sequence
alignments describe the relationships between biological sequences by
designating portions of sequences that correspond to each other. This
correspondence can reflect evolutionary conservation, structural similarity,
functional similarity or a random event.
ExPASy
ExPASy (Expert Protein Analysis System) world wide web server (http://
www.expasy.ch) is a service provided by a team at the Swiss Institute of
Bioinformatics (SBI) from 1993. It contains databases and analytical tools
related to proteins and proteomics. The databases include Swiss-PROT,
TrEMBL, SWISS-2DPAGE, PROSITE, ENZYME and SWISS-MODEL. The
analytical tools include similarity searches, pattern and profile searches, post-
translational modification prediction, topology prediction, primary, secondary
and tertiary structure analysis and sequence alignment.
Procedure
Open the internet browser and type the URL address: http://www.expasy.ch.
Pull the drop-down menu at search option. Select Swiss-Prot/TrEMBL. Type
the name of the protein in the TEXT box. Note down the details from the query
page which will show the name of the sequence, the taxonomy classification,
description of protein, the literature regarding the sequence, etc.
Table 2.3: Some basic sites for beginners of bioinformatics on the www
1. http://www.ncbi.nlm.nih.gov/
2. http://www.ebi.ac.uk/
3. http://www.expasy.ch/
4. http://www.embl.de/
5. http://www.izb.fraunhofer.de/en.html
6. http://themecraft.net/www/bmn.com
Apart from these, there are a great number of specialist sites with
biological data which can be accessed. General-purpose search engines such
as Google, Yahoo, Bing, Wikipedia, AltaVista and Hotbot are helpful in this.
STUDY QUESTIONS
1. What is a computer?
2. What is software?
3. Give some names of languages used in computer programs?
4. What are the advantages of PERL?
5. What is Internet?
6. How does Internet work?
7. What is World Wide Web?
8. What are browsers? Give some example.
9. How does Netscape Navigator Work?
10. Give details about EMBnet.
11. How is sequence retrieval system useful in bioinformatics?
12. What is the role of NCBI in maintaining sequence databases?
13. What is the use of Entrez?
14. Explain Bioseq and ExPASy.
C H A P T E R
3.1 BACKGROUND
Already by 1866 Gregor Mendel suggested that factors of inheritance were
existing in pea plants. In the beginning of twentieth century, it became clear
that Mendel’s factors were related to parts of the cell called chromosomes.
Chromosomes are thread like strands of chemical material located in the cell
nucleus.
Also, during this time, geneticists began using the terms ‘inheritance
unit’ and ‘genetic particle’ to describe the factors occurring on the
chromosomes of Mendel’s pea plants. By 1920s, these terms were discarded
and the word gene was used following the suggestion of Willard Johannsen.
Scientists viewed the gene as a specific and separate entity located on the cell’s
chromosome.
Initial Studies
In 1869, Friedrich Miescher isolated nucleic acid from nucleus and named this
substance nuclein. Later Phoebus Levene and his coworkers studied the
components of nuclein and gave it a more descriptive and technical name,
deoxyribonucleic acid (DNA). They also identified ribonucleic acids (RNA)
from some organisms.
Their analysis revealed that both nucleic acids contain three basic
components: (i) a five-carbon sugar, which could be either ribose (in RNA) or
deoxyribose (in DNA), (ii) a series of phosphate groups, that is, chemical
3.2 Basic Bioinformatics
groups derived from phosphoric acid molecules, and (iii) four different
compounds containing nitrogen and having the chemical properties of bases.
In DNA the four bases include adenine, thymine, guanine and cytosine; and in
RNA, they are adenine, uracil, guanine and cytosine. Adenine and guanine
are double – ring molecules known as purines; cytosine, thymine and uracil
are single-ring molecules called pyrimidines [Fig. 3.1].
Fig. 3.1 The components of nucleic acid. The first component is a phosphate group, a
derivative of phosphric acid composed of phosphoric, oxygen, and hydrogen atoms. The
second component is a five-carbon sugar, either deoxyribose (in DNA) or ribose (in RNA).
The third is a series of the five nitrogenous bases adenine, guanine, cytosine, thymine, and
uracil. Note the presence of nitrogen. The first two bases are known as purines; the last
three are pyrimidines.
DNA, RNA and Proteins 3.3
Advanced Studies
In 1949, Erwin Chargaff reported that in DNA the amount of adenine is
always equal to the amount of thymine regardless of the source of the DNA
and the amount of cytosine is consistently equal to the amount of guanine.
Chargaff’s observations played an important role in the double helix model of
DNA proposed by James D. Watson and Francis H.G. Crick, apart from the
experimented data of Maurice M.F. Wilkins and Rosalind Franklin which
suggested that the DNA molecule was a helix. (In 1962, Watson, Crick and
Wilkins were awarded the Nobel Prize in Physiology or Medicine.
Unfortunately Franklin had died of cancer in 1958 and because the Nobel
committee does not cite individuals posthumously, she did not share in the
award).
In 1902, Archibald Garrod postulated that a genetic disease is caused by
a change in the ancestor’s genetic material. He also suggested that due to lack
of an enzyme to break down alkapton, alkaptonuria disease occurs (Patients
with this disease expel urine that rapidly turns black on exposure to air. The
color change takes place because the urine contains alkapton, a substance that
darkens on exposure to oxygen. In normal individuals, alkapton [known
chemically as homogentisic acid] is broken down to simpler substance in the
body, but in persons with alkaptonuria, the body cannot make this
transformation, and alkapton is excreted).
In 1940s Beadle and Tatum postulated ‘one gene – one enzyme
hypothesis’ which suggested that the genes of a cell influence the production
of cellular enzymes (An enzyme is a protein that catalyses a chemical reaction
of metabolism while it itself remains unchanged).
0 C G 0 G 0
0
Aspartic acid
0 T A 0 A
0 0 U (Asp)
0 0 Alanine
0 (Ala)
0 C G 0 C
0 T A 0 0 U
0 G C 0 0 G
0 C G 0 0 C Alanine
0 0 0 Transalation (Ala)
0 A T 0 U 0
0 A T 0 U 0
0 G C 0 C 0 Phenylalanine
0 A T 0
U 0
(Phe)
0 0 Serine
0
0 A T 0 (Ser)
0 A T 0 0 A A
0 A T 0 0 A Condon A-A-G Lysine
0 G C 0 0 G translate into lysine (Lys)
0 0
0 0
Fig. 3.2 Gene expression and protein synthesis. (a) The base code in DNA is used to
formulate a base code in RNA by the process of transcription. The RNA molecule is then
used in translation to encode an amino acid sequence in a protein,
(b) Some selected triplet codes in DNA and RNA and the amino acid specified in the
protein. Note that the RNA code (known as a codon) is the complement of the DNA code
and that certain codons are "start" or "stop" signals.
DNA, RNA and Proteins 3.5
Genomic DNA
Transcription
mRNA
Translation
Protein
Fig. 3.3 The central dogma states that DNA is transcribed into RNA, which is then
transcribed later into protein.
3.2 DNA
DNA is a linear, double-helical structure (Fig. 3.4). The double-helix is
composed of two intertwined chains madeup of building blocks called
nucleotides (Fig. 3.5). Each nucleotide consists of a phosphate group, a
deoxiribose sugar molecule and one of four different nitrogenous bases:
adenine, guanine, cytosine or thymine. Each of the four nucleotides is usually
designated by the first letter of the base it contains: A, G, C or T.
1.0 nm
0.34 nm
Wide groove
3.4 nm
Narrow groove
2 nm
Fig. 3.4 What the X-ray diffraction photographs revealed about DNA. Watson and Crick
postulated that DNA is composed of two ribbon like "backbones" composed of alternating
deoxyribose and phosphate molecules. They surmised that nucleotides extend out from the
backbone chains and that 0.34 nm distance represents the space between sucessive
nucleotides. The data showed a distance of 34 nm between turns. So they guessed that ten
nucleotides exist per turn. One strand of DNA would only encompass 1 nm width, so they
postulated that DNA is composed of two stands to conform to the 2 nm diameter observed
in the X-ray diffraction photographs.
3.6 Basic Bioinformatics
O
H
H2C
N Thymine
– H N O
O
O P O CH2 O
5 H H
–
O N
3
N Adenine
N
H
–
O N N H
O P O CH2 O
H H
5
–
O N
3
H
Cytosine
N
–
H N O
O
O P O CH2 O
5 H H
–
O N
3 H
P N
N
5¢ T H
3¢ H
P –
N
O N N
5¢ A
3¢
P O P O CH2 O
5 Guanine
–
5¢ C O
3¢
3
P
5¢ G
3¢ OH
OH 3¢ end
Fig. 3.5 The binding of nucleotide to form a nucleic acid. The phosphate group forms a
bridge between the 5'carbon atom of one nucleotide and 3'carbon atom of the next
nucleotide. A water molecule H2O results form union of the hydroxyl group (-OH) formerly at
the 3'-carbon atom and a hydrogen atom (-H) formerly in the phosphate group. The linkage
between nucleotide is a "3'-5' linkage", the bond is called a phosphodiester bond. Note that
the 3' carbon of the lowest nucleotide is available for linking to another nucleotide (this is
called 3' end of the molecule) and that the phosphate group of the uppermost nucleotide
can link to still another nucleotide (this is the 5' end).
DNA, RNA and Proteins 3.7
Each nucleotide chain is held together by bonds between the sugar and
phosphate backbone of the chain. The two intertwined chains are held
together by weak bonds between bases of opposite chains. There is a lock and
key fit between the bases of the opposite strands, such that adenine pairs only
with thymine and guanine pairs only with cytosine. The bases that form base
pairs are said to be complementary. DNA is replicated by the unwinding of the
two strands of the double helix and the building up of a new complementary
strand on each of the separated strands of original double helix (Fig. 3.6).
Parent
molecule
G
C G
C G
A T
A
G
T A
C G
A T
A
G
C G
C G
A
GC
C
G C
G
A
T
T A
A T
T
A G
G
G C
C T
G A
A
T T
A A
T
A A
G
G
T A
A T G
C G
C
G C A
C
G
A
Fig. 3.6 The general plan of DNA replication. The double helix unwind, and the two 'old'
strands serve as templates for the synthesis of 'new' stands having complementary bases.
Centromere
Each chromosome has a constriction called centromere. Depending on the
position of centromere 4 types of chromosome types are seen. If the centromere
is found in the middle of the chromosomes, it is a metacentric type. If the
centromere is slightly away from the middle, it is submetacentric type. If the
centromere is found in the top of the chromosome, it is telocentric type. If the
centromere is very close to the tip, it is acrocentric type. The centromeres are the
sites of attachment of spindle fibres which are formed during cell division.
In many species a separate pair of chromosomes is present for sex
determination and they are referred to as sex chromosomes. All the other
chromosomes are referred to as autosomes. The presentation of complete
diploid set of chromosomes in a diagrammatic manner is called karyotype.
When the chromosomes are photographed using cytological preparation,
and then cut and pasted according to size, it is referred to as ideogram. The
end portions of chromosomes are called telomeres where short multiple repeat
sequences of DNA are arranged.
All living beings contain genetic information in the form of DNA within
their cells. A characteristic of all living organisms is that DNA is reproduced
and passed on to the next generation. DNA contains instructions for making
proteins.
Gene
A gene is a sequence of chromosomal DNA that is required for the production
of a functional product: a polypeptide or a functional RNA molecule. A gene
includes not only the actual coding sequences but also adjacent nucleotide
sequences required for the proper expression of genes.
DNA, RNA and Proteins 3.9
3.3 RNA
RNA is the other major nucleic acid and it is single-stranded unlike DNA
which is double-stranded. It contains ribose instead of deoxyribose as its
sugar-phosphate backbone, and the uracil (U) instead of thymine (T).
There are three types of RNAs in the cells for use in protein synthesis:
messenger RNA (mRNA), transfer RNA (tRNA) and ribosomal RNA (rRNA).
mRNA acts as a template for protein synthesis; the rRNA and tRNA form a
part of protein synthesizing machinery. mRNA is produced inside the nucleus
by transcription of protein coding genes by RNA polymerase II. In eukaryotic
systems the coding sequence in gene is not continuous as in prokaryotes (Fig.
3.7). There are a number of noncoding sequences known as introns
interspersed with the coding sequences called exons, the parts of the gene
expressed as protein. Introns do not contain information for functional gene
product such as protein but they contain switches for genes.
Prokaryote gene
Eukaryote gene
Introns
Fig. 3.7 Generalized gene structure in prokaryotes and eukaryotes. The coding region is
the region that contains the information for the structure of the gene product (usually a
protein). The adjacent regulatory regions (light line) contain sequences that are recognized
and bound by protein that make the gene's RNA and by proteins that influence the amount
of RNA made. Note that in eukaryotic gene the coding region is often split into segments
(exons) by one or more noncoding introns. (Source: A.J.F. . Griffiths et al., Modern Genetic
Analysis, W.H. Freeman and Company, 2002)
Pre-mRNA
When the RNA polymerase sweeps down the DNA template with introns and
exons a preliminary mRNA molecule is formed. Therefore, a processing of
premRNA is required to remove the non-coding introns from it. The introns are
3.10 Basic Bioinformatics
removed biochemically; the exons are spliced together to form the functional
mRNA molecule. Splicing makes the coding sequence continuous and the
mRNA emerges as an accurate template for building up of the protein (Fig. 3.8).
Fig. 3.8 The formation of mRNA. A gene consists of exons, the parts of the gene expressed
as protein, and introns, the intervening sequences between the exons. In the formation of
mRNA, the gene is transcribed to a preliminary mRNA molecule. Then the introns are moved
biochemically and the exons are spliced together. This activity results in the funational
mRNA molecule, which is then ready for translation. This type of processing does not occur
in mRNA production in prokaryotic cell such as bacterial cells; it occurs only in eukaryotic
cells such as plant, animal, and human cells.
Polyadenylation signal
(AAUAAA)
Transcription start site Transcription
Translation initiation site Translation termination site
Promoter termination site
GU A AG GU A AG
P
Gene
5¢ UTR Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 3¢ UTR
Addition of cap
Fig. 3.9 Transcriptional and translational landmarks in a eukaryotic gene with two introns
(top line), and the processing of its transcript to make mRNA. Note that since the landmarks
shown are relevant to RNA, U is given in the gene sequence instead of T. (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Splicing
Splicing is carried out inside the nucleus by a group of molecules which have
catalytic function similar to enzymes. That is composed of small RNA
molecules rich in uracil called URNAs or small nuclear RNAs (snRNAs) in
conjunction with small nuclear ribonucleo proteins (snRNPs). There are many
snRNPs such as U1, U2, U4, U5, U6 that are involved in splicing reactions.
The exon intron junction has a specific nucleotide sequence, which is called
signature sequence. This signature sequence is identified by the snRNPs. The
RNA portion of the snRNP interacts with the splice junction nucleotides and
base pair.
In vertebrate animals branch point sequence is present. The U1 snRNP
binds to the 5’ splice site and the U2 snRNP binds to the branch point
sequence. The remaining snRNPs, U5 and U4/U6 form a complex with U1
and U2 causing the intron to loop so that the exons come together.
The combination of the intron and snRNPs is called the spliceosome. The
spliceosomes curl the intron and bring the exon junction and also join the exon
ends (Fig. 3.10). In some unicellular organisms instead of snRNPs, mRNA
itself takes care of splicing with the help of ribonucleases of ribozyme.
3.12 Basic Bioinformatics
Pre-mRNA
GU A AG
Exon 1 Exon 2
Intron
Spliceosome composed
of five different SnRNPs
Spliceosome attached A
to pre-mRNA
1
U A
G G
2
SnRNPs
Spliced exons
Lariat A
Fig. 3.10 The structure and function of a spliceosome. The spliceosome is composed of
several snRNPs that attach sequentially to the RNA, taking up positions roughly as shown.
Alignment of the snRNPs results from hydrogen bonding of their snRNA molecules to the
complementary sequences of the intron. In this way the reactants are properly aligned and
the splicing reactions (1 and 2) can occur. The P-shaped loop, or lariat structure, formed by
the excised intron is joined through the central adenine nucleotide. (Source: A.J.F. Griffiths
et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Capping
Capping is a process by which the 5’ end of mRNA is protected from
exonuclease enzymes. Typically a prokaryotic mRNA remains stable only for a
few minutes. In eukaryoties the half-life of mRNA is around 6h. A nucleotide
may be deleted, added or substituted by RNA editing.
tRNA
tRNAs are adapter-like, small, linking molecules. The function of tRNA is to
fetch the correct amino acid to mRNA molecule and deposit it to the growing
polypeptide chain during protein synthesis. Every amino acid has its own
tRNA. tRNA has two ends. One end has the anticodon. This end base pairs
with Fe codon of mRNA. The other end acts as a socket to attach the amino
acid.
According to the sequence of codons in mRNA, the amino acids are
brought in by tRNAs and a specific polypeptide sequence is thus built. tRNA
molecules have between 74 to 95 nucleotides. tRNAs are produced in a
DNA, RNA and Proteins 3.13
precursor form called pre-tRNAs. Several tRNA genes are transcribed together
non-stop by RNA polymerase III enzyme. Ribonuclease enzyme cleaves the
tRNA molecule into individual tRNA.
Ribosomes
Ribosomes are macro molecules composed of both RNA and several
polypeptides. Ribosomes provide a firm platform for protein synthesis. Each
ribosome is composed of large and small subunits (Fig. 3.11).
Fig. 3.11. Ribosomes contain a large and a small subunit. Each subunit contains rRNA of
varing lenghts and a set of proteins. There are two principal rRNA molecules in all ribosomes
(shown in the column on the left). Ribosomes form prokaryotes also contain one 120-base-
long rRNA that sediments at 5S, whereas eukaryotic ribosomes have two small rRNAs; a 5S
RNA molecule similar to the prokaryotic 5S, and a 5.8S molecule 160 base long. The
proteins of the large subunit are named L1, L2, etc., and those of the small subunit proteins
S1, S2, etc. (Source: Lodish et al., Molecular Cell Biology, Scientific American Books, Inc.,
1995).
rRNA
The prokaryotic ribosomes are 70s type. The subunits have 50s and 30s values
(s stands for measurement in Swedberg unit). The 50s subunit has two rRNAs
and 31 polypeptides. The 30s subunit has a single rRNA and 21 polypeptides.
In eukaryotes the ribosomes are of 80s types. The subunits have 60s and 40s
values. The 60s subunit has 3rRNAs and about 49 polypeptides. The 40s
subunit has one rRNA and about 33 polypeptides. RNA polymerase 1
transcribes the rRNA genes.
3.14 Basic Bioinformatics
In prokaryotes such as E. coli, there are 7 copies of rRNA genes scattered
throughout the genome. Each gene contains one copy each of 16s, 23s and 5s
rRNA sequences arranged consecutively. The gene is transcribed as single
prerRNA (30s) molecule, which is processed to produce individual rRNAs.
The prerRNA folds into a number of stem-loop structures over which
ribosomal proteins bind. During this time some of the nucleotides of rRNA are
methylated. Finally, the ribonuclease RNAse III cleaves and releases 5s, 23s
and 16s RNAs. Mature rRNAs are formed by further trimming at 5’ and 3’
ends by ribonucleases M5, M16, and M23.
In eukaryotes, the sequences of the 28s, 18s and 5.8s rRNAs are present
in a single gene. This gene exists in multiple copies separated by short non-
transcribed regions. In humans, there are about 200 gene copies occurring in
5 clusters on separate chromosomes. RNA polymerase I transcribes these
genes. Transcription takes place in nucleolus inside the nucleus. In humans
prerRNA is 45s in size. It is processed to yield 28s, 18s and 5.8s rRNAs. The
eukaryotic prerRNA is processed similar to that in prokaryotes. The prerRNA
is cleaved to yield mature 28s, 18s and 5.8s rRNA by ribonucleases. Small
cytoplasmic RNAs (scRNAs) direct protein traffic within the eukaryotic cell.
Transcription
The first step taken by the cell to make a protein is to copy or transcribe the
nucleotide sequence in one strand of the gene into a complementary single-
stranded molecule called ribonucleic acid (RNA) Fig. 3.12). Component
nucleotides stored in the region are used for the synthesis, and an enzyme
called RNA polymerase binds the nucleotides together to form the RNA
molecule.
Nontemplate CTGCCATTGTCAGACATGTATACCCCGTACGTCTTCCCGAGCGAAAACGATCTGCGCTGC 3¢
DNA
strand 5¢
Template GACGGTAACAGTCTGTACATATGGGGCATGCCAGAAGGGCTCGCTTTTGCTAGACGACG 5¢
strand 3¢
5¢ CUGCCAUUGUCAGACAUGUAUACCCCGUACGUCUUCCCGAGCGAAAACGAUCUGCGCUGC 3¢ mRNA
Fig. 3.12. The mRNA sequence is complementary to the DNA template strand from which
it is synthesized and therefore matches the sequence of the nontemplate strand (except that
RNA has U where DNA has T). The sequence shown here is form the gene for the enzyme
β-galactosidase, which is involved in lactose metabolism. (Source: A.J.F. Griffiths et al.,
Modern Genetic Analysis, W.H. Freeman and Company, 2002).
DNA, RNA and Proteins 3.15
The production of RNA is called transcription, a word coined by Crick in
1956. The fragments so constructed are known as RNA transcripts. These
RNA molecules, together with ribosomal proteins and enzymes, constitute a
system that carries out the task of reading the genetic message and producing
the protein that the genetic message specifies.
The transcription process, which occurs in the cell nucleus, is very
similar to the process for replication of DNA because the DNA strand serves as
the template for making the RNA copy, which is called a transcript. The RNA
transcript, (which in many species undergoes some structural modifications)
becomes a working copy of the information in the gene, a kind of message
molecule called messenger RNA (mRNA). The mRNA then enters the
cytoplasm, where it is used by the cellular machinery to direct the manufacture
of a protein.
Translation
The process of producing a chain of amino acids based on the sequence of
nucleotides in the mRNA is called translation. The nucleotide sequence of a
mRNA molecule is read from one end of the mRNA to the other, in groups of
three successive bases. These groups of three are called codons (AUU, CCG,
UAC). Because there are four different nucleotides, there are 4 × 4 × 4 = 64
different possible codons, each one either coding for an amino acid or a signal
to terminate translation (Table 3.1).
Table 3.1: The genetic code. Notice that an amino acid can be coded by several
different codons. A stop codon does not code for an amino acid, but instead signals
to the ribosome that this is the end of the protein and that translation should cease.
Second letter
U C A G
Fig. 3.13a The addition of a single amino acid (aa6), carried by the tRNA at the A site, to
the growing polypeptide chain, tethered by the tRNA at the P site, during translation of
mRNA. (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and
Company, 2002).
Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon Codon
mRNA 1 2 3 4 5 6 7 8 9 10 11
Ribosomes
Fig. 3.13b The addition of an amino acid (aa) to a growing polypeptide chain in the
translation of mRNA. Multiple copies of the polypeptide are produced by a train of
ribosomes following each other along the mRNA; two such ribosomes are shown. (Source:
A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
3.18 Basic Bioinformatics
H R2 H R2 H R3
H N C C OH H N C C OH H N C C OH
H O H O H O
Amino Carboxyl
end H R1 H R2 H R3 end
H N C C N C C N C C OH + 2(HO)
H O H O H O
aa1 aa2 aa3
Peptide Peptide
(a) bond bond
Peptide group
1.24
H R
C C
1 1.
32
1.5 1.4
6
C N
R H
(b) H
Fig. 3.14 The peptide bond (a) A polypeptide is formed by the removal of water between
amino acids to form peptide bonds. Each aa indicates an amino acid. R1, R2 and R3
represent R groups (side chains) that differentiate the amino acids. R can be anything from
a hydrogen atom (as in glycine) to a complex ring (as in tryptophan), (Source: A.J.F. Griffiths
et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002). (b) The peptide group
is a rigid planar unit with the R groups projecting out form the C-N backbone. Standard bond
distances are shown in angstroms (Source: Stryer, L., Biochemistry, W.H. Freeman and
Company, 1995).
Table 3.2: The four naturally occurring nucleotides in DNA and RNA and 20
naturally occurring amino acids in proteins
Other classifications of amino acids can also be useful. For Instance, histidine, phenylalanine, tyrosine, and
tryptophan are aromatic, and are observed to play special structural roles in memberane proteins Amino acid
names are frequently abbreviated to their first three letters, for instance Gly for glycine, except for isoleucine,
asparagines, glutamine and htryptophan, which are abbreviated to Ile, Asn, Gin and Trp, respectively. The rare
amino acid selenocysteine has the three-letter abbreviation Sec and the one-letter code U.It is conventional to
write nucleotides in lower case and amino acids in upper case. Thus atg-adenine-thymine-guanine and
ATG= Alanine-Threonine-Glycine.
Structure
The linear sequence of amino acids in a protein molecule refers to primary
structure. Regions of local regularity within a protein fold (e.g. α-helices,
β-turns, β-strands) refer to secondary structure. Proteins show recurrent
patterns of interaction between helices and sheets close together in the
sequence. These arrangements of α-helices and/or β-strands into discrete
folding units (e.g. β-barrels, β α β-units, Greek keys, etc.) refer to super-
secondary structures (Fig. 3.15).
The overall fold of a protein sequence, formed by the packing of its
secondary and/or super-secondary structure elements refers to tertiary
structure. The arrangement of separate protein chains in a protein molecule
with more than one subunit refers to quaternary structure. The arrangement of
separate molecules such as in protein-protein or protein-nucleic acid
interactions refers to quinternary structure.
DNA, RNA and Proteins 3.21
(a)
(b)
(c)
Fig. 3.15 Common supersecondary structures (a) α–helix hairpin, (b) β–hairpin, (c) β-α-β
unit. The chevrones indicate the direction of the chain. (Source: Lesk, A.M., Introduction to
Bioinformatics, Oxford University Press).
Domains
Many proteins contain compact units within the folding pattern of a single
chain that look as if they should have independent stability. These are called
domains. In the hierarchy, domains fall between super-secondary structures
and the tertiary structure of a complete monomer, nodular proteins are multi
domain proteins which often contain many copies of closely related domains.
The most general classification of families of protein structures is based
on the secondary and tertiary structures of protein (Table. 3.3).
Motif
The active site of an enzyme which takes part in catalytic function occupies
only a small portion on the protein molecule. If the protein is stretched into a
polypeptide chain the active site region may be found distributed as discrete
patches on the primary structure. Such conserved small regions which confer
3.22 Basic Bioinformatics
characteristic minor shape to the protein are called motifs. Motifs are short
strings of base pairs characteristic of sites regulating particular events in gene
expression or chromosome replication such as 5’ splice sites or origins of
replication.
Folding Patterns
Within these broad categories, protein structures show a variety of folding
patterns. Among proteins with similar folding patterns, there are families that
share enough features of structure, sequence and function to suggest
evolutionary relationship. Classification of protein structures occupies a key
position in bioinformatics – as a bridge between sequence and function.
The amino acid sequence of a protein dictates its three dimensional
structure. When placed in a medium of suitable solvent and temperature
conditions, like the one provided by a cell interior, proteins fold spontaneously
to their native active states. If amino acid sequences contain sufficient
information to specify three-dimensional structures of proteins, it should be
possible to device an algorithm to predict protein structure from amino acid
sequence. But this has been difficult. Hence scientists have tried to predict
secondary structure, fold recognition and homology modeling.
Biochemical Nature
Biochemically, proteins play variety of roles in life processes; there are
structural proteins (e.g. viral coat proteins, the horny outer layer of human and
animal skin, and proteins of the cytoskeleton); proteins that catalyse chemical
reactions (the enzymes); transport and storage proteins (hemoglobin);
regulatory proteins, including hormones and receptor; signal transduction
proteins; proteins that control genetic transcription; and proteins involved in
recognition, including cell adhesion molecules, and antibodies and other
proteins of the immune system. Proteins are large molecules. In many cases
only a small part of the structure – an active site – is functional, the rest
existing only to create and fix the spatial relationship among the active site
residues.
DNA, RNA and Proteins 3.23
Chemical Nature
Chemically, protein molecules are long polymers typically containing several
thousand atoms composed of a uniform repetitive backbone (or main chain)
with a particular side chain attached to each residue. The polypeptide chains
of proteins have a main chain of constant structure and side chains that vary
in sequence. The side chains may be chosen, independently, from the set of
20 standard amino acids. It is the sequence of the side chains that gives each
protein its individual structural and functional characteristics.
Chaperones
Some proteins require chaperons to fold, but these catalqze the process, rather
than directing it. Molecular chaperones are helper proteins that ensure that
growing protein chains fold correctly. Chaperones are thought to block
incorrect folding pathways that would lead to inactive products, by preventing
incorrect aggregation and precipitation of unassembled subunits. They
probably bind temporarily to interactive surfaces that are exposed only during
the early stages of protein assembly.
Functions
Proteins serve several vital functions: (i) for catalyzing various biochemical
reactions (e.g. enzymes), (ii) as messengers 9 e.g. neurotransmitters), (iii) as
control elements that regulate cell reproduction, iv) growth and development
of various tissues (e.g. trophic factors), (v) oxygen transport in the blood (e.g.
hemoglobin), (vi) defense against diseases (e.g. antibodies), etc. The function of
a protein is determined by its shape.
STUDY QUESTIONS
1. Who coined the word gene?
2. Who isolated nucleic acid first?
3. Who gave the name DNA?
4. What is the contribution of Erwin Chargaff?
5. Who proposed the DNA double helix model?
6. Who proposed one gene-one enzyme hypothesis?
7. What is a chromosome?
8. What is a centromere? Name the different types of Centromere.
9. What are the different kinds of RNAs?
10. What is polyadenylation?
11. What is transcription?
12. What is translation?
13. What are the different structures of protein?
14. What is the function of chaperons?
C H A P T E R
Genomics
High-resolution
genetic maps Gene
Chromosome Transcript Interaction
conservation maps
evolution expression
Physical maps and evolution
Protein
Sequence maps expression
Transcript Polypeptide
maps maps
Fig. 4.1 Genomic analysis: A hierarchical view of genomic analysis (Source: A.J.F. Griffiths
4.1
et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman &
Company, New York, 2002).
DNA and Protein Sequencing and Analysis 4.3
Approaches to Genome Sequencing
Determination of the complete genomic DNA sequence of an organism allows
attempts to be made to identify all of an organism’s genes and therefore define
its genotype. Special experimental techniques have been devised to carry out
the difficult task of manipulating and characterizing large numbers of genes
and large amounts of DNA.
One approach to genome sequencing is first to generate high resolution
genetic and physical maps of the genome to define segments of increasing
resolution and then to sequence the segments in an orderly manner. Another
approach, the direct shotgun approach, is to break up the genome into random,
overlapping fragments, then to sequence the fragments and assemble the
sequences using computer algorithms.
Analysis of genomic sequences reveals that each organism has an array
of genes required for basic metabolic processes and genes whose products
determine the specialized function of the organism. Complete genome
sequencing therefore provides a knowledge base on which to build information
about gene and protein expression, but is not sufficient on its own to define the
entire protein component of the organism.
Proteomics
Proteomics is the cataloging and analysis of proteins to determine when a
protein is expressed, how much is made, and with what other proteins it can
interact. The term proteomics indicates proteins expressed by a genome. It is
the systematic analysis of protein profiles of tissues. The word proteome refers
to all proteins produced by a species at a particular time. Proteome varies with
time and is defined as “the proteins present in one sample (tissue, organism,
cell culture) at a certain point in time”.
Proteomics represents the genome at work and it is a dynamic process.
Proteomics can be divided into expression proteomics (the study of global
changes in protein expression) and cell-map proteomics (th systematic study
of protein-protein interactions through the isolation of protein complexes).
There is an increasing interest in proteomics because DNA sequence
information provides only a static snapshot of the various ways in which the
cell might use its proteins whereas the life of the cell is a dynamic process.
Proteins expressed by an organism change during growth, disease and the
death of cells and tissues. Proteomics attempts to catalog and characterize these
proteins, compare variations in their expression levels in healthy and diseased
tissues, study their interactions and identify their functional roles using leading
edge technological capability. Proteomics begins with the functionally modified
protein and works back to the gene responsible for its production.
Goals
The goals of proteomics are: (i) to identify every protein in the proteome, (ii) to
determine the sequence of each protein and entering the data into databases
4.4 Basic Bioinformatics
and (iii) to analyse globally protein levels in different cell types and at different
stages in development.
Uses
Proteomics will contribute greatly to our understanding of gene function in the
post genomic era. Differential display proteomics for comparison of protein
levels has potential application in a wide range of diseases. Because it is often
difficult to predict the function of a protein based on homology to other proteins
or even their three-dimensional structure, determination of components of a
protein complex or of a cellular structure is central in functional analysis.
Proteomics will also play an important role for drug discovery and
development by characterizing the disease process directly by finding sets of
proteins (pathways or clusters) that together participate in causing the disease.
Proteomics can be seen as a mass-screening approach to molecular
biology, which aims to document the overall distribution of proteins in cells,
identify and characterize individual proteins of interest, and ultimately
elucidate their relationships and functional roles.
Such direct protein-level analysis has become necessary because the
study of genes, by genomics, cannot adequately predict the structure or
dynamics of proteins, since it is at the protein level that most regulatory
processes take place, where disease processes primarily occur and where most
drug targets are to be found.
Cytogenetic
mapping
Gene
Molecular Molecular Molecular
marker 1 marker 2 marker 3 Genetic
high-resolution
mapping
Gene Cloned
fragments
Physical
mapping
DNA sequencing
TTAGCTTAACGTACTGGTACCGTACCGTGGCTTAT
Fig. 4.2 Overview of the general approaches of whole genome mapping. General scheme
for making a genome map by using analyses at increasing levels of resolution (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis: Integrating Genes and Genomes, W.H. Freeman &
Company, New York, 2002).
4.6 Basic Bioinformatics
4.3 DNA SEQUENCING METHOD
Methods are available to determine the order of nucleotides in DNA. One of the
methods is called chain termination sequencing or dideoxy sequencing or the
Sanger method after its inventor. The basic sequencing reaction consists of a
single – stranded DNA template, a primer to initiate the nascent chain, four
deoxyribonucleoside triphosphates (dATP, dCTP, dGTP and dTTP) and the
enzyme DNA polymerase, which inserts the complementary nucleotides in the
nascent DNA strand using the template as a guide.
Normally four DNA polymerase reactions are set up, each containing a
small amount of one of four dideoxyribonucleoside triphosphates (ddATP,
ddCTP, ddGTP and ddTTP). These act as chain terminating competitive
inhibitors of the reaction. Each of the four reaction mixtures generate a nested
set of DNA fragments, each terminating at a specific base (Fig. 4.3).
Template
5¢ GGATTCTGCTACGGA 3¢
5¢
Primer
Reaction including ddATP
5¢ GGATTCTGCTACGGA 3¢
ddATGCCT
ddACGATGCCT
ddAGACGATGCCT
ddAAGACGATGCCT A C G T
H
Reaction including ddCTP
C
5¢ GGATTCTGCTACGGA 3¢ C
ddCT T
A
ddCCT A
ddCGATGCCT G
ddCTAAGACGATGCCT A
ddCCTAAGACGATGCCT C
G
A
Reaction including ddGTP T
G
5¢ GGATTCTGCTACGGA 3¢ C
ddGCCT C
ddGATGCCT L T
ddGACGATGCCT
(b)
Reaction including ddTTP
5¢ GGATTCTGCTACGGA 3¢
ddT
ddTGCCT
ddTAAGACGATGCCT
(a)
Fig. 4.3 Principle of DNA sequencing (a) Four sequencing reactions are set up, each
containing a limiting amount of one of the four dideoxynucleotides. Each reaction generates
DNA and Protein Sequencing and Analysis 4.7
a nested set of fragments terminating with a specific base as shown. (b) A polyacrylamide
gel is shown with each reaction running in a separate lane of clarity. In a typical automated
reaction, all reactions would be pooled prior to electrophoresis and the terminal nucleotide
determined by scanning for a specific fluorescent tag. (Source: Twyman, R.M., Advanced
Molecular Biology @ BIOS Scientific Publishers Ltd., 1998).
Automated Methods
Most DNA sequencing reactions are automated, these days. Each reaction
mixture is labeled with a different fluorescent tag (on either the primer or on
one of the nucleotide substrates), which allows the terminal base of each
fragment to be identified by a scanner. All four reaction mixtures are then
pooled and the DNA fragments are separated by polyacrylamide gel
electrophoresis (PAGE). Smaller DNA fragments travel faster than the larger
ones.
Thus the nested DNA fragments are separated according to size. The
resolution of PAGE allows polynucleotides differing in length by only one
residue to be separated. Near the bottom of the gel, the scanner scans the
fluorescent tag as each DNA fragment moves past, and this is converted into
trace data, displayed as a graph comprising colored peaks corresponding to
each base (Fig. 4.4).
A C C A G C G G C T C T
Fig. 4.4 A sample of a high quality sequence trace, where all peaks are easily called.
Peaks are typically period in different color (shown here as different line styles) to aid visual
interpretation. Software such as Phred is used to read the peaks and assign quality value (A
= dark line; C= lighter; G = dotted line; T = dark line with breaks). (Source: Westhead, D.R.
et al., Instant Notes: Bioinformatics, Bios Scientific Publishers Ltd., 2003)
5¢ 3¢
Intron Intron
5¢ UTR Exon Exon Exon 3¢ UTR
Transcription
mRNA
Translation
Protein
Fig. 4.5 In eukaryotic systems exons from a part of the final coding sequence (CDS),
whereas introns are transcribed, but are the edited out by the cellular machinery before the
mRNA assumes its final form. Here, the gene is made up of three exons and two introns.
Exons, unlik coding sequences are not simply terminated by stop codons, but rather by
intron-exon boundaries; the untranslated regions (UTRs) occur at either end of the gene; if
transcription begins at the 5' end of the sequence, then the 5' UTR contains promoter sites
(such as the TATA box), and the 3' UTR follows the stop codon. (Source: Attwood, T.K. and
Parry-Smith, D.J., Introduction to Bioinformatics, Pearson Education Ltd., 2001)
Primer Design
The location of the primers on a DNA source will be determined relative to the
start and stop codons of the gene. The default option will find the ‘forward’
primer of a given length that resides within the first 35 basepairs upstream of
the coding sequence. The default option will also find the ‘reverse’ primer that
DNA and Protein Sequencing and Analysis 4.9
resides within 35 basepairs immediately following the coding sequence. We
can alter the endpoints of either of these by changing the number in the
Distance from the Start’ and ‘Distance from the Stop’ fields. We can also define
the exact 5’ endpoints of the primers by selecting the button marked ‘YES’ on
the line which asks about the exact endpoints.
Procedure
Open the Internet browser and type the URL address: http://
frodo.wi.mit.edu.cgi.bin/ primers3/primer3_www.cgi. Paste the sequence in
the text box. Choose the primer. Click the left and right primer. Press ‘Pick
Primer’ button and the result will be displayed in a new page.
cDNA 5¢ 3¢
EST
CDS
UTR
Fig. 4.6 When constructing a library, complementary DNA (cDNA) is run off from the
mRNA stage, using reverse transcriptase. ESTs are then generated using a single read of
each clone on an automated sequencing system. In the mRNA, the start codon may be
flanked by a Kozak sequence, which gives additional confidence to the prediction of the
start of the CDS. (Source: Attwood, T.K. and Parry-Smith, D.J., Introducton to
Bioinformatics, Pearson Education Ltd., 2001)
Table 4.1: Percentage use of codons for serine in a variety of model organisms.
There are six possible codons for serine, which in principle could be used with
equal frequency whenever serine is specified in a CDS. In practice, however,
organisms are highly selective in the particular codons they use. The characteristic
differences in usage reflected here can be used to help diagnose regions of DNA
that may code for protein.
3¢ Template DNA
(b) 5¢ ddGTP
ddGTP
5¢ ddGTP
5¢ ddGTP
3¢ C C CC 5¢
Fig. 4.7 Template DNA sequencing: (a) Chain synthesis and termination by incorporation of
ddGTP;) (b) the family of chains terminated at different positions by ddGTP. Since G pairs
with C the template sequence contains C at each of these positions.
Paired Paired
and and
reads reads
Scaffold
Sequenced Sequenced Sequenced
contig 1 GAP contig 2 GAP contig 3
Fig. 4.8 Whole genome shotgun sequencing assembly. First, the unique sequence overlaps
between sequences reads are used to build contigs. Paired-end reads are then used to span
gaps and the order and orient the contigs into larger unit called Scaffolds. (Source: A.J.F.
Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002)
Typically 36 lanes are run on a gel at once. The output consists of a series
of (colour-coded) peaks, beneath which is a string of base symbols. Sometimes
the software that interprets the chromatogram is unable to determine which base
should be called at specific position. So a ‘-‘ appears. Such ambiguous positions
are replaced by ‘N’ in the resulting sequencing file.
Cell or tissue
Fig. 4.9 Overview of how ESTs are constructed. (Source: Wolfberg, T.G. and Landsman, D.,
Expressed Sequence Tags (ESTs), in Bioinformatics – a practical guide to the analysis of
genes and proteins (eds) Baxevanis, A.D. and Francis Quellette, B.F., John Wiley & Sons,
Inc, 2002)
Fig. 4.10 The alignment of fully sequenced cDNAs and ESTs with genomic DNA. The soild
4.1
lines indicate regions of alignment; for the cDNA, these are the exons of the gene. The dots
between segments of cDNA or ESTs indicate regions in the genomic DNA that do not align
with cDNA or EST sequences; these are the locations of the introns. The number above the
cDNA line indicate the base coordinates of the cDNA sequence, where base 1 is the 5' -
most base and base 816 is the 3' -most base of the cDNA. For the ESTs, only a short
sequence read form either the 5' or 3' end of the corresponding cDNA is obtained. This
establishes the boundaries of the transcription unit, but it is not informative about the
internal structure of the transcript unless the EST sequences cross an intron (as is ture for
the 3' EST depicted here). (Source: A.J.F. Griffiths et al., Modern Genetic Analysis, W.H.
Freeman and Company, 2002).
Determination of Structure
Protein structures can be determined using X-ray crystallography and nuclear
magnetic resonance spectroscopy (NMR). X-ray crystallography involves the
reconstruction of atomic positions based on the diffraction pattern of X-rays
through a precisely orientated protein crystal. Scattered X-rays cause positive
and negative interference, generating an ordered pattern of signals called
reflections.
DNA and Protein Sequencing and Analysis 4.15
Structural determination depends on three variables: the amplitude and
phase of the scattering (which depend on the number of electrons in each
atom), and the wavelength of the incident X-rays. The basis of NMR
spectroscopy is that some atoms, including natural isotope of nitrogen,
phosphorous and hydrogen behave as tiny magnets and can switch between
magnetic spin states in an applied magnetic field. This is achieved by the
absorbance of low wavelength electromagnetic radiation, generating NMR
spectra. Other methods such as magic angle spinning NMR and circular
dichroism spectroscopy are also used.
Prediction
There are three main approaches to secondary structure prediction:
(i) empirical statistical methods that use parameters derived from known 3D
structures; (ii) methods based on physicochemical criteria (e.g., fold
compactness, hydrophobicity, charge, hydrogen bonding potential, etc.) and
(iii) prediction algorithms that use known structures of homologous proteins
to assign secondary structure.
One of the standard empirical statistical methods is that of Chou and
Fasman, which is based on observed amino acid conformational preferences in
non-homologous proteins. But in spite of being a ‘standard’ approach, like all
other methods, its reliability to derive the conformational potentials of the
amino acids has been inadequate. By contrast, for prediction algorithms, the
use of multiple sequence data can improve matters and may yield
enhancements of several percent. Tertiary structure prediction (especially
methods that build on secondary predictions) is still further beyond reach.
+ Biotinylated
Poly A RNA +
AAAAA TTTTTT oligo dT
cDNA synthesis
AAAAA
TTTTTT
Nla III
Streptaridin-coated
magnetic beads
CATG AAAAA
GTAC TTTTTT
+ +
‘Pool A’ ‘Pool B’
AAAAA AAAAA
TTTTTTT TTTTTTT
Ligate, PCF
amplify
CATGCCTAGTCAGGCGACTTCACATGCCAAAGTGCTTTCGAGACATGGAAGTCCTACGATCATGGCATG
Fig. 4.11 Simplified outline method for serial analysis of gene expression. Nla lll is a frequent
cutting restriction enzyme used intially to generate the 3' cDNA fragments and provide the
overhang for linker ligation, and later to remove the linkers prior to concatamerization of the
ditags. Foki is type lls restriction enzyme with a recognition site in the linker that generates the
SAGE tags by cutting the DNA a few bases downstream. (Source: D.R. Westhead et al., Instant
Notes: Bioinformatics, Bios Scientific Publishers Ltd. 2003)
DNA and Protein Sequencing and Analysis 4.17
4.8.1 DNA Microarrays
Presently, DNA arrays (DNA chips) are used widely. A DNA microarray or
DNA chip is a dense grid of DNA elements (often called features or cells)
arranged on a miniature support, such as nylon filter or glass slide. Each
feature represents a different gene. (The specificity of nucleic acid hybridization
is such that a particular DNA or RNA molecule can be labeled (with a
radioactive or fluorescent tag) to generate a probe, and can be used to isolate a
complementary molecule from a very complex mixture, such as whole DNA or
whole cellular RNA).
The array is usually hybridized with a complex RNA probe, i.e. a probe
generated by labeling a complex mixture of RNA molecules derived from a
particular cell type. The composition of such a probe reflects the levels of
individual RNA molecules in its source. If non saturating hybridization is
carried out, the intensity of the signal for each feature on the microarray
represents the level of the corresponding RNA in the probe, thus allowing the
relative expression levels of thousands of genes to be visualized simultaneously.
The most widely used method involves the robotic spotting of individual
DNA clones onto a coated glass slide. Such spotted DNA arrays can have a
density of up to 5000 features per square cm. The features comprise double-
stranded DNA molecules (genomic clones or cDNAs) up to 400 bp in length
and must be denatured prior to hybridization (Fig. 4.12)
Test Reference
DNA clones Laser 1 Laser 2
Reverse
transcription
Label with Emission
fluor dyes
Quantify emission in
PCR amplification
red and green
purification
wavelength bands
Robotic
printing
Analyze relative
expression levels
Hybridize terget by computer
to microarray
Fig. 4.12 The Process of differential expression measurement using a DNA microarray.
DNA clones are first amplified and printed out to form a microarray. Test and reference RNA
samples are then reverse transcribed and labled with different fluor dyes (Cy5 and Cy3),
which fluoresce in different (red, green) wavelength bands. These are hybridized to the
microarray. Fluorescence of each dye is then measured for each samples. (Source: Duggan
D.J. et al., Expression profiling using cDNA microarrays. Nature Gene. 21 (suppl 2): pp
10-14, 1999).
4.18 Basic Bioinformatics
Genechips
Another method is on-chip photolithographic synthesis, in which short
oligonucleotides are synthesized in situ during chip manufacture. These arrays
are known as Genechips. They have a density of up to 1,000,000 features per
square cm, each feature comprising up to 109 single-stranded oligonucleotides
25 nt in length. Each gene on a Genechip is represented by 20 features (20
overlapping oligos), and 20 mismatching controls are included to normalize
for nonspecific hybridization.
Fluorescent probes are used for spotted DNA arrays, since different
fluorophores can be used to label different RNA populations. These can be
simultaneously hybridized to the same array, allowing differential gene
expression to be monitored directly. In Genechips, hybridization is carried out
with separate probes on two identical chips and the signal intensities are
measured and compared by the accompanying analysis software.
Data Analysis
The raw data from microarray experiments consists of images from hybridized
arrays. The exact nature of the image, depends on the array platform (the type
of array used). DNA arrays may contain many thousands of features.
Therefore, data acquisition and analysis must be automated. The software for
initial image processing is normally provided with the scanner. This allows
the boundaries of individual spots to be determined and the total signal
intensity to be measured over the whole spot (signal volume). The signal
intensity should be corrected for background and control measures should be
included to measure nonspecific hybridization and variable hybridization
across arrays.
The aim of data processing is to convert the hybridization signals into
numbers which can be used to build a gene expression matrix. The
interpretation of microarray experiment is carried out by grouping the data
according to similar expression profiles. Clustering is a way of simplifying
large data sets by partitioning similar data into specific groups. Many software
applications are available for implementing microarray data analysis methods
(Table 4.2).
Applications
DNA microarray has the following applications:
(i) Investigating cellular states and processes: Patterns of expression that
change with cellular state can give clues to the mechanisms of the
processes such as sporulation, or the change from aerobic to anaerobic
metabolism.
(ii) Diagnosis of disease: Testing for the presence of mutations can confirm
the diagnosis of a suspected genetic disease, including detection of late-
onset condition such as Huntington disease, to determine whether
prospective parents are carriers of a gene that could threaten their
children.
DNA and Protein Sequencing and Analysis 4.19
Table 4.2: Internet resources for microarray expression analysis. The first two sites
are very comprehensive and contain hundreds of links to databases, software and
other resources. Two web-based suites of analysis program are also listed as well as
some databases that store microarray and other gene expression data.
(iii) Genetic warning signs: Some diseases are not determined entirely and
irrevocably by genotype, but the probability of their development is
correlated with genes or their expression patterns. A person aware of an
enhanced risk of developing a condition can in some cases improve his
or her prospects by adjustments in lifestyle.
4.20 Basic Bioinformatics
(iv) Drug selection: Detection of genetic factors that govern responses to
drugs, that in some patterns of gene expression. Knowing the exact type
of disease is important in selecting optimal treatments.
(v) Classification of disease: Different types of leukemia can be identified by
different patterns of gene expression. Knowing the exact type of disease
is important in selecting optimal treatments.
(vi) Target selection for drug design: Proteins showing enhanced transcription
in particular disease states might be candidates for attempts at
pharmacological intervention (provided that it can be demonstrated, by
other evidence, that enhanced transcription contribute to or is essential
the maintenance of the disease state).
(vii) Pathogen resistance: Comparisons of genotypes or expression patterns,
between bacterial strains susceptible and resistant to an antibiotic,
point to the protein involved in the mechanism of resistance.
Isoelectric focusing
Isoelectric focusing means allowing proteins to migrate in an electric field until
the pH of the buffer is the same as the pI of the protein. The pI of the protein is
the pH at which it carries no net charge and therefore does not move in the
applied electric field. Next the gel is equilibrated in the detergent sodium
dodecylsulphate (SDS), which binds uniformly to all proteins and confers a
net negative charge. Therefore, separation in the second dimension can be
carried out on the basis of molecular mass.
After the second dimension separation, the protein gel is stained with a
universal dye to reveal the position of all protein spots. Reproducible
separations can then be carried out with similar samples to allow comparison
of protein expression levels. It provides a diagnostic protein fingerprint of any
particular sample (Fig. 4.13).
The stained protein gel is scanned to obtain a digital image. Individual
protein spots are then detected and quantified, and the intensity of the signal
for each spot is corrected for local background. Several algorithms are
available based on Gaussian fitting or Laplacian of Gaussian spot detection.
Spots whose morphology deviates from a single Gaussian shape can be
interpreted using a model of overlapping shapes.
DNA and Protein Sequencing and Analysis 4.21
Fig. 4.13 A section from a 2D protein gel. The sample has been seperated on the basic of
isoelectric pH (horizontal dimension) and molecular mass (vertical dimension). Each spot
should correspond to a single protein.
Other Methods
A simpler approach is line and chain analysis, in which columns of pixels
from the digital image are scanned for peaks in signal density. This process is
repeated for adjacent pixel columns allowing the algorithm to identify the
centers of spots and their overall signal intensity. Another method is known as
watershed transformation. In this method, pixel intensities are viewed as a
topographical map so that hills and valleys can be identified. This is useful for
separating clusters, chains and small spots overlapping with larger ones
(shouldered spots) and also for merging regions of a single spot.
The output of each method is a spot list. Differential protein expression
can also be analysed using 2D-PAGE. This can be used to look for proteins
that are induced or repressed by particular treatments or drugs, to look for
proteins associated with disease states, or to look at changes in protein
expression during development. Once protein expression data have been
recorded, they are built into a protein expression matrix. The results from 2D-
PAGE experiments are generally stored in 2D-PAGE databases. They can be
found at:
http://www.ucl.ac.uk/ich/services/labservices/mass_spectrometry/
proteomics/technologies/2d_page
http://world-2dpage.expasy.org/swiss-2dpage/
Approaches
One approach for discovering disease-related genes is the technique of
positional cloning. Here the chromosome linked to the disease in question is
found out by analyzing a population of people some of whom exhibit the
disease. Once a link to a chromosomal region is established, a large part of the
chromosome in the vicinity of the region (locus) is sequenced, yielding several
megabases of DNA. Such a locus can contain many genes, only one of which is
likely to be involved in some way in the disease process.
Sequence searching and gene prediction techniques can be used to
increase the efficiency of gene identification in the locus, but ultimately several
genes will need to be expressed, and further experimentation (or validation)
will be required to confirm which gene is actually involved in the disease.
Although genes discovered in this way can be very illuminating from an
academic point of view, they do not necessarily represent good drug targets (or
points of therapeutic intervention).
Another approach to gene discovery, requiring much less sequencing
effort and relying more heavily on the powerful search capabilities of current
computer systems, examines the genes that are actually expressed in healthy
and diseased tissues. This allows a comparison to be performed between the
two states, and a process of reasoning applied to arrive at a potential drug
target in a more direct way. This process analyses the mRNAs, which are used
by the cellular machinery as a template for the construction of the proteins
themselves.
Gene Finding
In gene finding, generally elements such as splice sites, start and stop codons,
branch points, promoters and terminators of transcription, polyadenylation
sites, ribosome binding site, topoisomerase-II binding sites, topoisomerase I
cleavage sites and various transcription factor binding sites are included.
Local sites like these are called ‘signals’ and are detected by ‘signal sensors’.
In contrast to this, extended and variable length sequences such as exons and
introns are called ‘contents’ and are detected by content sensors. Most
sophisticated signal sensors in use are neural nets. Commonly used content
sonsor is the one which predicts coding regions.
Several systems that combine signal and content sensors have been
developed in an attempt to identify complete gene structure. Such systems are
capable of handling more complex interdependencies between gene features.
Genelaug is one of the earliest integrated gene finders to date, which uses
dynamic scored regions and sites into a complete gene prediction with a
maximal total score.
DNA and Protein Sequencing and Analysis 4.23
The main feature of dynamic programming is the one which includes a
latent or hidden variable associated with each nucleotide that represents the
functional role or position of that nucleotide. These models are called hidden
Markov models (HMMs). Most popular statistical methods used for gene
finding are Markov models using gene mare program. Some of the important
gene finding HMMs include Ecoparse, Expound, etc. The list of computational
gene finding data bases are given in Table 4.3.
In prokaryotes, it is still common to locate gene by simply looking an
open reading frame (ORF). This is certainly not adequate for higher eukaryotes.
To distinguish between coding and noncoding regions in higher eukaryotes,
exon content sensors are used which use statistical models of the nucleotide
frequencies and dependencies, which are present in codon structure.
1. Genefinding datasets
a) Single genes http://www.cbcb.umd.edu/research/genefinding.shtml
b) Annotated contigs ftp://www-hgc.ilb.gov/pub/genesets/
http://igs-server.cors-mrs.fr/banbury/index/hyml
c) Hmm-based gene finders
Genie http://www.fruitfly.org/seq_tools/genie.html
Genscan http://genes.mit.edu/GENSCANinfo.html,
HMMgene http://genes.mit.edu/GENSCAN.html
GenMark http://www.cbs.dtu.dk/services/HMMgene/
Pirate http://opal.biology.gatech.edu/GeneMark/
http://www.cbcb.umd.edu/software/pirate/
d) Other gene finders
AAT http://aatpackage.sourceforge.net/
FGENEH http://linux1.softberry.com/
berry.phtml?topic=fgenesh&group
=programs&subgroup=gfind
GENEID http://genome.crg.es/geneid.html
GeneParser http://beagle.colorado.edu/~eesnyder/geneparser.html
Glimmer http://www.cbcb.umd.edu/software/glimmer/
Grail http://grail.lsd.ornl.gov/grailexp/
Procrusters http://www-hto.usc.edu/software/procrusters
GENE FINDING http://www.molquest.com/
molquest.phtml?group=index&topic=gfind
http://www.biologie.uni-hamburg.de/b-online/library/
genomeweb/GenomeWeb/nuc-geneid.html
4.24 Basic Bioinformatics
Levels of Gene Expression
The human genome is complex, consisting of about 3 billion base pairs (bp) of
DNA. Yet only 3% of the DNA is coding sequence (i.e. that part of the genome
that is transcribed and translated into protein). The rest of the genome consists
of areas necessary for compact storage of the chromosomes, replication at cell
division, the control of transcription, and so on. A large part of the work of
sequence analysis is centered on analyzing the products of the transcription/
translation machinery of the cell, i.e. protein sequences and structures.
Recently much industrial emphasis has been placed on the study of
mRNA; this is partly because a conceptual translation into protein sequence
can be generated readily, but the main reason is that mRNA molecules
represent the part of the genome that is expressed in a particular cell type at a
specific stage in its development.
Thus, in simple terms, we have three levels of genomic information: (i) the
chromosomal genome (genome) – the genetic information common to every cell
in the organism, (ii) the expressed genome (transcriptome) – the part of the
genome that is expressed in a cell at a specific stage in its development and (iii)
the proteome – the protein molecules that interact to give the cell its individual
character.
For each level, different analytical tools and interpretative skills are
required. Cells express a different range of genes at various stages during their
development and functioning. This characteristic range of gene expression is
the expression profile of the cell.
By capturing the cell’s expression profiles we can build up a picture of
what levels of gene expression may be normal or abnormal and what the
relative expression levels are between different genes within the same cell. This
process also provides a rapid approach to gene discovery that complements
full-blown genome sequencing projects.
Mobile DNA
90 Mb
Fig. 4.14 Content of the human genome (Based on IHGSC April 2003)
DNA and Protein Sequencing and Analysis 4.27
• Repeat sequences (those which do not code for proteins) make up about
50% of the genome (Repeat sequences are thought to maintain
chromosome structure and dynamics. By rearrangement it creates
entirely new genes or modify and reshuffle existing genes).
• About 40% of the human proteins showed similarity with fruit-fly or
worm proteins.
• Genes appear to be spread randomly throughout the genome with vast
expanses of noncoding DNA in between
• Chromosome 1 (the largest human chromosome) has 2968 genes and
the Y chromosome (smallest human chromosome) has 231 genes.
• Candidate genes were identified for numerous diseases and disorders
including breast cancer, muscle disease, deafness and blindness.
• Single nucleotide polymorphism can occur in 3 million locations.
• Every 2kb contains a microsatellite (short tandem repeat)
(Anderson et al., have decoded the entire sequence of human
mitochondria. The circular and double stranded genome contains 16569 base
pairs and 37 genes. Among them, thirteen genes code for respiratory complex
proteins and the other 24 genes represent RNA molecule for the expression of
mitochondrial genome).
The ‘Periodic Table of Life’ developed from HGP will be beneficial to
everyone in many ways. James Watson and the joint NIH-DOE genome
advisory panel were against patenting the genes. They were of the view that
public was paying for deciphering the genome and they must decide what to
do with the information.
Also scientists should have access to all available gene data for the
advancement of genome research program. In 1997, NIH established GenBank
and made everyone to access information through Internet. This encouraged
many to refrain from taking out patent on raw sequence data.
Molecular Medicine
• to develop better disease diagnosis
• to detect genetic predispositions to diseases
• to design drugs based on molecular information and individual genetic
profiles
• useful for better gene therapy
Microbial Genomics
• to detect and treat pathogens speedily
• to develop new biofuels
4.28 Basic Bioinformatics
• to protect citizens from biological and chemical warfare
• to clean up toxic waste safely and efficiently
Risk Assessment
• to evaluate the level of health risk in individuals who are exposed to
radiation or mutagens
• to detect pollutants and monitor environments
DNA Identification
• to identify criminals whose DNA may match evidence left at crime
scenes
• to exonerate persons wrongly accused of crimes
• to establish paternity and other family relationships
• to identify endangered and protected species
• to detect bacteria and other organisms that may pollute environment
• to match donors with recipients in organ transplant programs
• to determine pedigree for seed or livestock breeds
STUDY QUESTIONS
1. What does the basic DNA sequencing reaction consist of?
2. Describe how DNA sequencing is done.
3. What is the role of open reading frame?
4. How do you determine the sequence of a clone?
5. What are expressed sequence tags?
6. How an expressed sequence tag is sequenced?
7. What are the methods of protein sequencing?
8. What is DNA microarray?
DNA and Protein Sequencing and Analysis 4.29
Today biological data are gathered and stored all over the world. In order to
interpret these data in a biologically meaningful way, we need special tools
and techniques. Databases and programs allow us to access the existing
information and to compare these data to find similarities and differences.
The various Internet based molecular biology databases have their own
unique navigation tools and data storage formats.
Given a sequence, or fragment of a sequence, how to find sequences in
the database that are similar to it? Given a protein structure, or fragment,
how to find protein structures in the database that are similar to it? Given a
sequence of protein of unknown structure, how to find structures in the
database that adopt similar 3D structures? Given a protein structure, how to
find sequences in the database that correspond to similar structure? Different
data retrieval tools help to solve these problems.
Types of Databases
There are many different database types, depending both on the nature of the
information being stored and on the manner of data storage. Databases are
broadly classified into two types, namely, generalized databases and
specialized databases. Examples of generalized databases are DNA, protein,
carbohydrate or similar databases. Examples of specialized databases are
expressed sequence tags (EST), genome survey sequences (GSS), single
nucleotide polymorphism (SNP) sequence tagged sites (STS), or similar
databases. Other specialized databases include Kabat for immunology
proteins and Ligand for enzymes reaction ligands.
Generalized databases are again broadly classified into sequence
databases and structure databases. Sequence databases contain the
individual sequence records of either nucleotides or amino acids or proteins.
Structure databases contain the individual sequence records of biochemically
solved structures of macromolecules (e.g. Protein 3 D structure).
Two principal types of databases are: (i) relational and (ii) object-
oriented. The relational database orders the data to tables made up of rows
giving specific items in the database and columns giving the features as
attributes of those items. The object-oriented database includes objects such as
genetic maps, genes, or proteins which have an associated set of utilities for
analysis which help in identifying the relationships among these objects.
Classification
More specifically databases can be classified into three types based on the
complexity of the data stored: (i) Primary database, (ii) secondary database and
(iii) composite database.
Databases, Tools and their Uses 5.3
Primary database contains data in its original form, taken as such from
the source. e.g. GenBank for genome sequences and SWISS-PROT for protein
sequences. They are also known as archival databanks. Secondary database is
a value added database which contains some specific annotated and derived
information from the primary database, e.g. SCOP, CATH, PROSITE. These are
the derived databanks that contain information collected from the archival
databanks after analysis of their contents. Composite database amalgamates a
variety of different primary database structures into one.
A redundant database is a database where more than one copy of each
sequence may be found. Databases constructed by using subsets of the
original database for reducing sampling bias are often referred to as non-
redundant databases.
Some databases that form specialized resources are called boutique
databases. They either have a species specific sequence data or contain
sequences obtained through a particular technique (e.g. Saccharomyces
genome database (SGD), Drosophila genome database, etc). In addition to
these, Bibliographic Databanks and the databanks of websites are also
available on the net.
Database Entries
Database entries comprise new experimental results, and supplementary
information or annotations. Annotations include information about the
source of data and the methods used to determine them. They identify the
investigators responsible for the discovery and cite relevant publications.
They provide links to connected information in other databanks. Curators in
databanks base their annotations on the analysis of the sequence by computer
programs.
To make sure that all the fundamental data related to DNA and RNA are
freely available, scientific journals require deposition of new nucleotide
sequences in the database as a condition for publication of an article. Similar
conditions apply to amino acid sequences, and to nucleic acid and protein
structures. EMBL (European Molecular Biology Laboratory) nucleotide
sequence database submission procedures are available at http://
www.ebi.ac.uk/embl/submission.
Sequence Formats
Many databases and software applications are designed to work with
sequence data, and this requires a standard format for inputting nucleic acid
and protein sequence information. Three of the most common sequence
formats are NBRF/PIR (National Biomedical Research Foundation/ Protein
Information Resource), FASTA and GDE. Each of these formats has facilities
not only for representing the sequence itself, but also for inserting a unique
code to identify the sequence and for making comments which may include
for example the name of the sequence, the species from which it was derived,
and an accession number for GenBank or another appropriate database.
5.4 Basic Bioinformatics
NBRF/PIR format begins with either >P1; for protein or >N1; for nucleic
acid. FASTA format begins with only ‘>’, and the GDE format begins with ‘%’.
A feature table (lines beginning FT) is a component of the annotation of an
entry that reports properties of specific regions, for instance coding sequences
(CDS). The feature table may indicate regions that perform or affect function,
that interact with other molecules, that affect replication, that are involved in
recombination, that are a repeated unit, that have secondary or tertiary
structure and that are revised or corrected.
Database Record
A typical database record contains three sections:
(i) The header includes description of the sequence, its organism of origin,
allied literature references and cross links to related sequences in other
databases. Locus field contains a unique identifier summarizing the
function of the sequence in abbreviation and is followed by an
accession number in the Accession field. The organism field contains
the binomial of the organism and its full taxonomic classification.
(ii) The feature table contains a description of the features in the record like
coding sequences, exons, repeats, promoters, etc., for the nucleotide
sequences and domains, structure elements, binding sites, etc., for
protein sequences. If the feature table includes a coding DNA sequence
(CDS), links to the translated protein sequences are also mentioned in
the feature description.
(iii) The sequence (per se) is often more easily analyzed by the computer.
Types
There are three traditional types of database management systems:
hierarchical, relational and network. Hierarchical and network models are
Databases, Tools and their Uses 5.5
based on traversing data links to process a database. The data are represented
by a hierarchical structure and connection are defined and implemented by
physical address pointers within the records. They are typically used for large
mainframe systems.
EMBL
The EMBL nucleotide sequence database (http:\\www.ebi.ac.uk/embl) is
available at the EMBL European Bioinformatics Institute, UK. It contains a
large and freely accessible collection of nucleotide sequences and
accompanying annotations. Webin is the preferred tool for submission.
Databases, Tools and their Uses 5.7
EMBL contains sequences from direct author submissions and genome
sequencing groups, and from the scientific literature and patent applications.
The database is produced in collaboration with DDBJ and GenBank; each of
the participating groups collects a portion of the total sequence data reported
worldwide, and all new and updated entries are then exchanged between the
groups. The rate of growth of DNA database has been following an
exponential trend, with a doubling time now estimated to be about 9-12
months.
The format of EMBL entries is consistent with SWISS-PROT format.
Information can be retrieved from EMBL using the SRS (sequence Retrieval
System); this links the principal DNA and protein sequence databases with
motif, structure, mapping and other specialist databases and includes links to
the MEDLINE facility. EMBL may be searched with query sequences via
EMBL’s web interfaces to the BLAST and FASTA programs.
DDBJ
The DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) contains
expressed sequence tags (EST) and genome sequence data.
Procedure
Open the internet browser and type the URL: www.ddbj.ac.jp. Pull the drop -
down menu at search option. Select protein or nucleotide. Type it in the TEXT
box. Note down the details from the query page which will show the accession
number, description of the query, total number of base pairs, etc.
DDBJ database is produced, maintained and distributed at the National
Institute of Genetics; sequence may be submitted to it from all corners of the
world by means of a web-based data-submission tool. The web is also used to
provide standard search such as FASTA and BLAST.
GenBank
GenBank from NCBI incorporates sequences from publicly available sources,
primarily from direct author submissions and large-scale sequencing projects.
Information can be retrieved from GenBank using the Entrez integrated
retrieval system. GenBank may be searched with user query sequences by
means of the NCBI’s Web interface to the BLAST suite of programs.
The increasing size of the database coupled with the diversity of the data
sources available, have necessitated splitting GenBank database into 17
smaller discrete divisions with a 3 letter code each (Table 5.1).
Table 5.1: The 17 subdivisions of GenBank database
GSDB
The Genome Sequence Data Base (GSDB) is produced by the National Centre
for Genome Resources at Santa Fe, New Mexico. GSDB creates, maintains,
and distributes a complete collection of DNA sequences and related
information to meet the needs of major genome sequencing laboratories. The
format of GSDB entries is consistent with that of GenBank. The database is
accessible either via the web, or using relational database client-server
facilities.
The main sequence databases have a number of subsidiaries for the
storage of particular types of sequence data. dbEST is a division of GenBank
which is used to store expressed sequence tags (ESTs). dbGSS is used to store
single-pass genomic survey sequences (GSSs); dbSTS is used to store sequence
tagged sites (STSs) and HTG (high-throughput genomic) is used to store
unfinished genomic sequence data. OMIM (Online Mendelian Inheritance in
Man) is a comprehensive database of human genes and genetic disorders
maintained by NCBI.
Databases, Tools and their Uses 5.9
Ensembl
Ensembl http://asia.ensembl.org/index.html) is intended to be the universal
information source for the human genome. The goals are to collect and
annotate all available information about human DNA sequences, link it to the
master genome sequence and make it accessible to many scientists who will
approach the data with many different points of view and requirements. To
achieve this, in addition to collecting and organizing the information, very
serious effort has gone into developing computational infrastructure. The
program used to generate this resource, eMOTIF, is based on the generation
of consensus expressions from conserved regions of sequence alignments.
Ensembl is a joint project of the European Bioinformatics Institute and
the Sanger Centre. It is organized as an open project; it encourages outside
contributions. Data collected in Ensembl include genes, SNPs, repeats and
homologies. Genes may either be known experimentally, or deduced from the
sequence. Because the experimental support for annotation of the human
genome is so variable Esnsembl presents the supporting evidence for
identification of every gene. Very extensive linking to other databases
containing related information such as OMIM or expression databases is also
possible.
PIR Databases
The PIR is an effective combination of a carefully curated database information
retrieval access software and a workbench for investigations of sequences. The
PIR also produces the Integrated Environment for Sequence Analysis (IESA).
Its functionality includes browsing, searching and similarity analysis and
links to other databases.
The PIR maintains several databases about proteins:
(a) PIR-PSD: The main protein sequence database
(b) iProclass: Classification of proteins according to structure and function.
(c) ASDB: annotation and similarity database; each entry is linked to a list
of similar sequences.
(d) P/R-NREF: a comprehensive non-redundant collection of over 8,00,000
protein sequences merged from all available sources.
(e) NRL3D: a database of sequences and annotations of proteins of known
structure deposited in the protein Data bank.
(f) ALN: a database of protein sequence alignment.
(g) RESID: a database of covalent protein structure modifications.
PIR database is split into four distinct sections, designated as PIR1, PIR2.
PIR3 and PIR4. They differ in terms of the quality of data and levels of
annotation provided; PIR1 includes fully classified and annotated entries;
PIR2 contains preliminary entries, which have not been fully reviewed and
which may contain redundancy. PIR3 includes unverified entries, which
Databases, Tools and their Uses 5.11
have not been reviewed; and PIR4 entries fall into one of the following four
categories; (i) conceptual translations of artefactual sequences, (ii) conceptual
translations of sequences that are not transcribed or translated, (iii) Protein
sequences or conceptual translations that are extensively genetically
engineered; and (iv) sequences that are not genetically encoded and not
produced on ribosomes. Programs are provided for data retrieval and
sequence searching via the NBRF-PIR database Web Page.
SWISS-PROT
The Swiss Institute of Bioinformatics (SIB) collaborates with the EMBL Data
Library to provide an annotated database of amino acid sequences called
SWISS-PROT. SWISS-PROT is a curated protein sequence database which
strives to provide high-level annotations, including descriptions of the
function of the protein and of the structure of its domains, its post-
translational modifications, variants and so on with a minimal level of
redundancy and high level of integration with other databases. SWISS-PROT
is interlinked to many other resources. The structure of the database and the
quality of its annotations places SWISS-PROT apart from other protein
sequence resources and has made it the database of choice for most research
purposes.
Entries start with an identification (ID) line and finish with a //
terminator. ID codes in SWISS-PROT are designed to be informative and
people-friendly; they take the form PROTEIN_SOURCE, where the
PROTEIN part of the code is an acronym that denotes the type of protein,
and SOURCE indicates the organism name. Since ID codes can sometimes
change an additional identifier, an accession number, is also provided, which
will remain static between database releases. The accession number is
provided on the AC line, which is computer readable. If several numbers
appear on the same AC line, the first or primary accession number is the
most current.
The DT lines provide information about the date of entry of the
sequence to the database, and details of when it was last modified. The DE
(description) line, informs us of the name, by which the protein is known. The
following lines give the gene name (GN), the organism species (OS) and
organism classification (OC) within the biological kingdom. The next section
of the database provides a list of supporting references; these can be from the
literature, unpublished information submitted directly from sequencing
projects, data from structural or mutagenesis studies and so on.
Following the references, the comment (CC) lines are found. These are
divided into themes, which tell us about the function of the protein, its post-
translational modifications, its tissue specificity, cellular location and so on.
The CC lines also point out any known similarity or relationship to particular
protein families. Database cross-reference (DR) lines follow the comment field.
These provide links to other biomolecular databases, including primary
sources, secondary databases, specialist databases, etc.
5.12 Basic Bioinformatics
Immediately after the DR lines a list of relevant keywords (KW) are seen,
and then a number of FT lines can be found. The FT highlights regions of
interest in the sequence, including local secondary structure (such as trans-
membrane domains), ligand binding sites, and post-translational
modifications and so on. Each line includes a key, the location in the
sequence of the feature, and a comment, which might, for example, indicate
the levels of confidence of a particular annotation.
The final section of the database entry contains the sequence itself on
the SQ lines. Only single letter amino acid code is used. The structure of
SWISS-PROT makes computational access to the different information fields
both straightforward and efficient.
TrEMBL
TrEMBL (translated EMBL) was designed in 1996 as a computer-annotated
supplement to SWISS-PROT. The database benefits from the SWISS-PROT
format, and contains translations of all coding sequences in EMBL. TrEMBL
has two main sections, designated as SP-TrEMBL and REM-TrEMBL; SP-
TrEMBL (SWISS-PROT TrEMBL) contains entries that will eventually be
incorporated into SWISS-PROT, but that have not yet been manually
annotated; REM-TrEMBL contains sequences that are not destined to be
included in SWISS-PROT; these include immunoglobulins and T-cell
receptors, fragments of fewer than eight amino acids, synthetic sequences,
patented sequences, and codon translations that do not encode real proteins.
TrEMBL was designed to allow very rapid access to sequence data from
the genome projects, without having to compromise on the quality of SWISS-
PROT itself by incorporating sequences with insufficient analysis and
annotation.
PIR is the most comprehensive resource, but the quality of its
annotations is still relatively poor. SWISS-PROT is a highly structured
database that provides excellent annotations, but its sequence coverage is poor
compared to PIR.
NRL-3D
The NRL-3D database is produced by PIR from sequences extracted from the
Protein Data Bank (PDB). The titles and biological sources of the entries
conform to the nomenclature standards used in the PIR. Bibliographic
references and MEDLINE cross references are included, together with
secondary structure, active site, binding site and modified site annotations,
and details of experimental methods, resolution, R-factor, etc. Keywords are
also provided.
NRL-3D is a valuable resource, as it makes the sequence information in
the PDB available both for keyword interrogation and for similarity searches.
The database may be searched using the ATLAS retrieval system, a multi-
database information retrieval program specifically designed to access
macromolecular sequence databases.
Databases, Tools and their Uses 5.13
5.4 STRUCTURE DATABASES
Structure Databases archive, annotate and distribute sets of atomic
coordinates. They store a collection of 3 dimensional biological
macromolecular structures of proteins and nucleic acids. The last established
database for protein structures is Protein Data Bank (PDB). The website is
http://www.rcsb.org/pdb/home/home.do
This is the single world-wide repository of structural data and is
maintained by Research Collaborators for Structural Bioinformatics (RCSB)
at Rudgers University, New Jersey, USA. (The associated nucleic acid
databank (NDB) is also maintained here). An equivalent European database
is the Macromolecular Structure Database (MSD) maintained by the
European Bioinformatics Institute. The website for MSD is http://
www.ebi.ac.uk/Databases/structure.html RCSB and MSD databases contain
the same data.
The PDB entry normally contains the following informations: the name
of the protein, the species it comes from, who solved the structure, references
to publications, describing the structure determination, experimental details
about the structure determination, the amino acid sequence, any additional
molecules and atomic coordinates. MSD includes a search tool called OCA,
which is a browser database for protein structure and function, integrating
information from numerous databanks. Another useful information source
available at the EBI is the database of Probable Quaternary Structures (PQS)
of biologically active forms of proteins.
Structural Classifications
Many proteins share structural similarities, reflecting, in some cases,
common evolutionary origins. The evolutionary process involves
substitutions, insertions and deletions in amino acid sequences. For distantly
related proteins, such changes can be extensive, yielding folds in which the
numbers and orientations of secondary structures vary considerably.
However, where, for example, the functions of proteins are conserved, the
structural environments of critical active site residues are also conserved. With
a view to better understand sequence structure relationships, struture
classification schemes have been evolved.
Several websites offer hierarchical classifications of the entire PDB
according to the folding patterns of the proteins.
(i) SCOP : Structural classification of Proteins
(ii) CATH : Class/ Architecture/ Topology/ Homology
(iii) DALI : Based on extraction of similar structure from distance matrices.
(iv) CE : a database of structural alignments.
SCOP Database
The SCOP database describes structural and evolutionary relationships
between proteins of known structure. Since current automatic structure
5.14 Basic Bioinformatics
comparison tools cannot reliably identify all such relationships, SCOP has
been designed using a combination of manual inspection and automated
methods. Proteins are classified in a hierarchical fashion to reflect their
structural and evolutionary relatedness. Within the hierarchy there are many
levels, but principally these describe the family, super family and fold.
Proteins are clustered into families with clear evolutionary relationships
if they have sequence identities of more than 30%. Proteins are placed in
super families when, in spite of low sequence identity, their structural and
functional characteristics suggest a common evolutionary origin. Proteins are
suggested to have a common fold if they have the same major secondary
structures in the same arrangement and with the same topology, whether or
not they have a common evolutionary origin. SCOP is accessible for keyword
interrogation via the MRC Laboratory Web Server.
CATH Database
The CATH (lass, architecture, topology, homology and sequence) database is
largely derived using automatic methods, but manual inspection is necessary
where automatic methods fail. Different categories within the classification
are identified by means of both unique numbers and descriptive names.
There are five levels (class, architecture, topology, homology and sequence)
within the hierarchy.
Class is derived from gross secondary structure content and packing.
Architecture describes the gross arrangement of secondary structures.
Topology gives a description that encompasses both the overall shape and the
connectivity of secondary structures. Homology groups domains that share
more than 35% sequence identity and are thought to share a common ancestor.
Sequence provides the final level within the hierarchy whereby structures
within homology groups are further clustered on the basis of sequence
identity. CATH is accessible for keyword interrogation via UCL’s Biomolecular
Structure and Modeling Unit Web server.
CATH database is a protein structure database residing at University
College, London. Proteins are classified first into hierarchical levels by class,
similar to the SCOP classification except that α/β and α + β proteins are
considered to be in one class. Instead of a fourth class for α + β proteins, the
fourth class of CATH comprises proteins with few secondary structures.
Following class, proteins are classified by architecture, fold superfamily and
family.
Composite Databases
A composite database is a database that amalgamates a variety of different
primary sources. Composite databases render sequence searching much
more efficient, because they obviate the need to interrogate multiple
resources. The interrogation process is streamlined still further if the composite
has been designed to be non-redundant, as this means that the same sequence
need not be searched more than once.
Databases, Tools and their Uses 5.15
Different strategies can be used to create composite resources. The final
product depends on the chosen data sources and the criteria used to merge
them. The choice of different sources and the application of different
redundancy criteria have led to the emergence of different composites, each
of which has its own particular format. The main composite databases are
NRDB, OWL, MIPSX and SWISS-PROT+ TrEMBL.
NRDB (Non-Redundant Database) is comprehensive and contains up-
to-date information. OWL is a non-redundant protein database with a
priority with regard to the level of annotation and sequence validation.
MIPSX database contains information of only unique copies. SWISS-PROT +
TrEMBL provide a resource that is both comprehensive and minimally
redundant.
NDB Database
The Nucleic acid structure Database (NDB) (http://ndbserver.rutgers.edu/)
assembles and distributes structural information about nucleic acids. In
addition to information regarding nucleic acids it maintains a DNA-binding
protein database. Available information includes coordinates and structure
factors, an archive of nucleic acid standards and an atlas of nucleic acid
containing structures that highlight special aspects of each structure in the
NDB. It also maintains information regarding intrinsic correlations between
structural parameters.
CSD Database
Cambridge structural Database (CSD) contains comprehensive structural data
for organic and organic-metallic compounds studied by X-ray and neutron
diffraction. It contains 3D atomic coordinate information as well as associated
bibliographic, chemical and crystallographic data. It is equipped with
graphical, search, retrieval, data manipulation and visualization software.
BMRB Database
BioMagResBank (BMRB) contains data from NMR studies of proteins,
peptides and nucleic acids (www.bmrb.wisc.edu). It is used to deposit the
data that is used to derive the NMR restraints and the coordinates deposited
into the PDB. It contains NMR parameters that are measures of flexibility and
dynamics. It also contains data on measured NMR parameters such as
chemical shifts, coupling constants, dispolar couplings, T1 values, T2 values,
heteronuclear NOE values, Se (order parameters), hydrogen exchange rates
and hydrogen exchange protection factors.
Other Databases
Molecular Modeling Database (MMDB) is a database containing
experimentally determined structures extracted from PDB. Its organization is
based on the concept of neighbors-links to sequential and structural
neighbors. MMDB categorizes proteins of known structure in the
Brookhaven PDB into structurally related groups by the VAST (Vector
Alignment Search Tool) structural alignment program. VAST aligns three
dimensional structures based on a search for similar arrangements of
secondary structural elements. MMDB provides a method for rapidly
identifying PDB structures that are statistically out of the ordinary.
Conserved Domain Database (CDD) is a database of conserved domain
alignments with links to three-dimensional structures of domains. Chemico-
physical AMino acidic Parameter databank (CHAMP) is an amino acidic
parameters data bank containing 32 different series of physico-chemical
parameters of amino acids. It is integrated with FAST. The Enzyme-Reaction
Database links a chemical structure to amino acid sequences of enzymes that
recognize the chemical structure as their ligand. The chemical structures and
chemical names are registered in the chemical-structure database on the
MACCS system.
The enzymes are registered in the database with NBRF-PIR entry codes.
The enzymes’ sequences in the database are divided into clusters and a
conserved sequence is extracted from each cluster using multiple sequence
alignment. These conserved sequences are used to construct motifs.
Thermodynamic Database for Proteins and Mutants (ProTherm) is a
collection of numerical data for studying the relationship between structure,
stability and function. It contains thermodynamic parameters such as
unfolding Gibbs free energy change, enthalpy change, heat capacity change,
transition temperature, etc. It also contains information about activity,
secondary structure, surface accessibility, measuring methods and
experimental conditions such as pH, temperature, and buffer ion and protein
concentration. ProTherm is linked with PIR and SWISS-PROT, PDB, PMD and
PubMed.
The SARF (spatial arrangement of backbone fragments) database also
provides a protein database categorized on the basis of structural similarity.
Secondary Databases
Primary database search tools are effective for identifying sequence
similarities, but analysis of output is sometimes difficult and cannot always
Databases, Tools and their Uses 5.17
answer some of the more sophisticated questions of sequence analysis. Hence
secondary database search tools are used. Depending on the type of analysis
method using secondary data bases, relationships may be elucidated in
considerable detail, including superfamily, family, subfamily, and species-
specific sequence levels.
The principle behind the development of secondary databases is that
within multiple alignments, there are many conserved motifs that reflect
shared structural or functional characteristics of the constituent sequences.
The simplest approach to pattern recognition is to characterize a family by
means of a single conserved motif, and to reduce the sequence data within
the motif to a consensus or regular expression pattern. Regular expressions
are the basis of the PROSITE database.
Many secondary databases, which contain the fruits of analysis of the
sequences in the primary sources, are also available. Many secondary
databases such as PROSITE, Profiles, PRINTS, Pfam, BLOCKS, IDENTIFY
use SWISS-PROT as primary source. PROSITE stores Regular Expression
(patterns); Profiles stores weighted matrices (profiles); PRINTS stores aligned
motifs (fingerprints). Pfam stores hidden Markov Models (HMMs). BLOCKS
stores aligned motifs (blocks), and IDENTIFY stores fuzzy regular
expressions (patterns).
The type of information stored in each of the secondary databases is
different. Yet these resources have arisen from a common principle; namely,
that homologous sequences may be gathered together in multiple alignments,
within which are conserved regions that show little or no variation between
the constituent sequences. These conserved regions or motifs, usually reflect
some vital biological role (i.e. are somehow crucial to the structure or
function of the protein).
One of the aims of sequence analysis is to design computational methods
that help to assign functional and structural information to uncharacterized
sequences; this is achieved by means of primary database searches, the goal of
which is to identify relationships with already known sequences. Within a
database, the challenge is to establish which sequences are related (true-
positive) and which are unrelated (true-negatives). To improve diagnostic
performance one has to capture most of true-positive family members and to
include no or few false positives.
PROSITE Database
PROSITE was the first secondary database to be developed. The rationale
behind its development was that protein families could simply and effectively
be characterized by the single most conserved motif observable in a multiple
alignment of known homologues, such motifs usually encoding key biological
functions (e.g. enzyme active sites, ligand or metal binding sites, etc.).
Searching such a database should, in principle, help to determine to which
family of proteins a new sequence might belong, or which domain or
functional site it might contain.
5.18 Basic Bioinformatics
PRINTS Database
Most protein families are characterized not by one, but by several conserved
motifs. It therefore makes sense to use many, or all, of these to build diagnostic
signatures of family membership. This is the principle behind the development
of the PRINTS fingerprint database. Fingerprints inherently offer improved
diagnostic reliability over single-motif methods by virtue of the mutual context
provided by motif neighbours; in other words, if a query sequence fails to
match all the motifs in a given fingerprint, the pattern of matches formed by the
remaining motifs still allows the user to make a reasonably confident
diagnosis.
BLOCKS Database
A multiple-motif database, called BLOCKS, was created by automatically
detecting the most highly conserved regions of each protein family.
The limitations of regular expression in identifying distant homologues
led to the creation of a compendium of profiles. The variable regions between
conserved motifs also contain valuable sequence information. Here the
complete sequence alignment effectively becomes the discriminator.
HMMs
An alternative to the use of profiles is to encode alignments in the form of
Hidden Markov Models (HMMs). These are statistically based mathematical
treatments, consisting of linear chains of match, delete or insert states that
attempt to encode the sequence conservation within aligned families. A
collection of HMMs for a range of protein domains is provided by the Pfam
database.
PubMed
PubMed is maintained by the National Library of Medicine (US) and includes
a bibliographic database MEDLINE as well as links to selective full text
articles on sites maintained by journal publishers. It offers abstracts of
scientific articles and is integrated with other information retrieval tools of the
National Centre for Biotechnology Information. Scientific journals place their
table of contents and in some cases, entire issues, on web sites. PubMed
records are relational in nature and query results include links to the
GenBank, PDB, etc. PubMed databases can be searched at the following
websites:
http://www.ncbi.nlm.nih.gov/PubMed/
http://www.pubmedcentral.nih.gov
AGRICOLA
AGRICOLA stands for Agricultural online access. It is a bibliographic
database of citations to the agricultural literature created by the National
Agricultural Library and its cooperators. It includes publications and
resources from all the disciplines related to agriculture, such as, veterinary
science, plant science, forestry, aquaculture and fisheries, food and human
nutrition, earth and environmental science. The database can be searched at
the following website: http://www.nal.usda.gov/ag98/
Virtual Library
Virtual library on the net provides access to web sites that are a storehouse of
information. It contains a collection of links to various online journals and
5.20 Basic Bioinformatics
bibliographic databases. Virtual library can be classified into various groups
with links to various online journal, bibliographic databases, institute library
access, forums and associations, tutorial sites, educational sites, grants and
funding resources, government and regulatory bodies, etc. The most famous
virtual library site in the web is: http://www.vlib.org
There are also further collections of virtual libraries on various topics
such as microbiology, biochemistry, etc. Many publishers have their own
online journals available on sites (e.g. Nature: www.nature.com). These sites
provide free access to the table of contents and abstracts.
GCG Package
The most widely known, commercially available sequence analysis software is
the GCG (Oxford Molecular Group). This was developed by the Genetic
Computer Group at Wisconsin (575 Science Drive, Medison, Wisconsin, USA
53711) primarily as a set of analysis tools for nucleic acid sequences, but
which in time included additional facilities for protein sequence analysis.
Databases, Tools and their Uses 5.21
Within GCG, many of the frequently used sequence databases can be
accessed (e.g. GenBank, EMBL, PIR and SWISS-PROT) as can a number of
motif and specialist databases (such as PROSITE; TFD, the transcription factor
database; and REBASE, the restriction enzyme database). A particular strength
of the system is that it can also be relatively easily customized to accept
additional, user-specific databases. Within the suite, EMBL and GenBank are
split into different sections, allowing users to minimize search time by
directing queries only to relevant parts of the databases. Thus, for example,
sequences in GenBank and EMBL may be searched either collectively or
separately or by defined taxonomic categories (e.g. viral, bacterial. Rodent, etc.).
The sequence databases have their own distinct formats, so these must be
converted to the GCG format for use with its programs. Likewise, all data files
imported to the suite for analysis must adhere to the GCG format. The facilities
include tools for pairwise similarity searching, multiple sequence alignment,
evolutionary analysis, motif and profile searching, RNA secondary structure
prediction, hydropathy, and antigenecity plots, translation, sequence assembly,
restriction site mapping and so on.
EGCG Package
EGCG or Extended GCG started at EMBL in Heiddberg as a collection of
programs to support EMBOL’s research activities. There are more than 70
programs in EGCG, covering themes such as fragment assembly, mapping,
database searching, multiple sequence analysis, pattern recognition, nucleotide
and protein sequence analysis, evolutionary analysis, and so on.
Staden Package
The Staden Package is a set of tools for DNA and protein sequence analysis. It
does not provide databases, but the software works with the EMBL database
and other databases in a similar format. The package has a windowing
interface for UNIX workstations. Amongst its range of options, the suite
provides utilities to define and to search for patterns of motifs in proteins and
nucleic acids (for example, specific individual routines allow searching for
mRNA splice junctions, E. coli promoters, tRNA genes, etc. and users may
define equally complex patterns of their own). A particular strength of the
Staden Package lies in its support for DNA sequence assembly.
It provides methods for all the pre-processing required for data from
fluorescence-based sequencing instruments, including trace viewing (TREV),
quality clipping (PREGAP4) and vector removal (PREGAP4, VECTOR_CLIP);
a range of assembly engines; and powerful contig editing and finishing
algorithms (GAP4). A new method for detecting point mutation is also there
(TRACE_DIFF, GAP4). For analysis of finished DNA sequences, the package
includes NIP4, and for comparing DNA or protein sequences, SIP4; these
routines also provide an interface to the sequence libraries. The new interactive
programs TEV, PREGAP4, GAP4, NIP4 and SIP4 have graphical user-interfaces,
but the package also contains a large number of older, but still useful,
programs that are text-based.
5.22 Basic Bioinformatics
Lasergene Package
Lasergene is a PC-based package that provides facilities for coding analysis,
pattern and site matching, and RNA/DNA structure and composition
analysis; restriction site analysis; PCR primer and probe design; sequence
editing; sequence assembly and contig management; multiple and pairwise
sequence alignment (including doplots); protein secondary structure
prediction and hydropathy analysis; helical wheel and net creation; and
database searching. Lasergene is available for windows or Macintosh, for
single users or for networked-PC environments.
There are numerous other packages available, which tend to concentrate
on particular areas of sequence analysis of DNA. For example:
Sequencher Package
Sequencher is a sequence assembly package for the Macintosh, used by many
laboratories engaged in large-scale sequencing efforts. The package takes raw
chromatogram data and converts it into contig assemblies; other functions
include restriction site or ORF analysis, heterozygote analysis for mutation
studies, vector and transposon screening, motif analysis, silent mutation tools,
sequence quality estimation, and visual marking of edits to ensure data
integrity.
MacVector Package
MacVector is a molecular biology system that exploits the Macintosh user
interface to create an easy-to-use environment for manipulation and analysis
of DNA and protein sequence data. The package implements the five BLAST
search functions, and includes ClustalW for sequence alignment, and an icon-
managed sequence editor that is integrated with the program’s molecular
biology functions (e.g. translation, restriction analysis, primer and probe
analysis, protein structure prediction, and motif analysis). Facilities are also
provided to compute predicted sequence-based melting curves for DNA and
RNA structures.
Intranet packages: The future for commercial solutions lies in providers
understanding the key issues facing the large industrial user. Most companies
now have intranets and support the use of HTTP and Internet Inter-ORB
Protocol (IIOP). Bioinformatics solutions must fit as easily seamlessly as
possible into this environment. Most companies need to implement integration
throughout the research operation. Most industrial bioinformatics teams
Databases, Tools and their Uses 5.23
devote some resources to development and maintenance of internal web
servers that replicate the services available at public bioinformatics sites. Two
companies, NetGenics Inc. and Pangea Systems Inc., provide bioinformatics
systems that offer the prospect of service integration via the intranet.
SYNERGY
SYNERGY, developed by NetGenic, Inc., Cleveland, ohio, is an object-oriented
approach using Java, CORBA, and an object-oriented database, to implement a
flexible environment for managing bioinformatics projects. SYNERGY
integrates standard tools into its portfolio through the use of CORBA
‘Wrappers’, which present a streamlined interface between the tool and the
SYNERGY system. In this way, the developers are able to incorporate a
number of standard programs very rapidly and users of the system are able to
incorporate their own tools by implementing CORBA wrappers in-house.
Pangea Systems
GeneMill, GeneWorld and GeneThesaurus are the developments of Pangea
Systems Inc., Oakland, California. These are web-based tools that are back-
ended by a relational database. The overall system is aimed at high-
throughput sequencing projects and other large-scale industrial genomics
projects, including, for example, GeneMill, a sequencing workflow database
system for managing sequencing projects; Geneworld, a tool for analysis of
DNA and protein sequences; and GeneThesaurus, a sequence and annotation
data subscription service, allowing access to public data and integration with
proprietary data. The system is modular and allows interfaces to in-house
software to be built easily, using an open programming interface, PULSE
(Pangea’s Unified Life Science Environment).
EMBOSS Package
European Molecular Biology Open Software Suite (EMBOSS) is an integrated
set of packages and tools for sequence analysis being specifically developed
for the needs of the Sanger Centre and the EMBnet user communities.
Application of the package include: EST clustering, rapid database searching
with sequence patterns, Nucleotide sequence pattern analysis, code usage
analysis, Gene identification tools, Protein motif identification.
Alfresco Package
Alfresco is a visualization tool that is being developed for comparative genome
analysis, using ACEDB for data storage and retrieval. The program compares
multiple sequences from similar regions in different species, and allows
visualization of results from existing analysis programs, including those for
gene prediction, similarity searching, regulatory sequence prediction, etc.
DALI Program
DALI (Distance matrix Alignment) program is used to quantify proteins with
folding patterns similar to that of a query structure. L. Holm and C. Sander
5.24 Basic Bioinformatics
wrote this program. It runs fast enough to carry out routine screens of the
entire protein Data Bank for structures similar to a newly determined structure,
and even to perform a classification of protein domain structures from an all-
against-all comparison.
To meet the need for effective software technique for data analysis, many
software packages have been developed. These packages are highly specific in
their approach and can be easily loaded as per the requirements of the user
(Table 5.2).
Table 5.2: Some well known packages with a set of tools for DNA and protein
sequence analysis
Package Scope
Staden Analyses of DNA and protein sequence. It has a window interface for
UNIX workstations.
Genemill, Gene World, Genemill package system manager sequence projects. Gene World
Gene Thesaurus analyses DNA and protein sequences. Gene Thesaurus allows access to public
data and integration with proprietary data.
Lasergene Coding analysis, pattern site matching, structure and comparison
analysis of RNA/DNA, restriction site analysis, PCR primer and probe
designing, sequence editing, sequence assembly, multiple and pairwise
sequence analysis-helical wheel and net creation, and database
searching.
Synergy An object oriented package, uses java, COBRA and object-oriented
database to implement a flexible environment for managing
bioinformatics projects.
CINEMA A colour Interactive Editor for Multiple Alignments, an internet package
written in Java, provides facilities for motif identification, database
searching (using BLAST), 3d structure visualization, generation of
dotplots and hydropathy profiles, six-frame translation.
EMBOSS The European Molecular Biology Open Software suite specifically
developed for easy integration of other public domain packages and
other applications like EST clustering, nucleotide sequence pattern
analysis, codon usage analysis, gene identification tools, protein motif
identification and rapid databases searching with sequence pattern.
EGCG Developed by Genetics Computer Group, Wisconsin, an extended
version of GCG, has more than 70 programs including fragment
assembly, mapping, database-searching, multiple sequence analysis,
pattern recognition, nucleotide and protein sequence analysis,
evolutionary analysis, etc.
ExPASy ExPASy is the SIB Bioinformatics Resource Portal which provides
access to scientific databases and software tools (i.e., resources) in
different areas of life sciences including proteomics, genomics,
phylogeny, systems biology, population genetics, transcriptomics, etc.
KEGG KEGG is a database resource for understanding high-level functions
and utilities of the biological system, such as the cell, the organism and
the ecosystem, from molecular-level information, especially large-scale
molecular datasets generated by genome sequencing and other high-
throughput experimental technologies
Databases, Tools and their Uses 5.25
5.7 USE OF DATABASES
The available information on the biological function of particular sequences in
model organisms may be exploited to predict the function of similar gene in
other organisms. The sequence of the gene of interest is compared to every
sequence in a sequence database, and the similar ones are identified. If a query
sequence can be readily aligned to a database sequence of known function,
structure or biochemical activity, the query sequence is predicted to have the
same function, structure or biochemical activity. As a rough rule, if more than
one-half of the amino acid sequence of query and database proteins is identical
in the sequence alignments, the prediction is very strong.
A common reason for performing a database search with a query
sequence is to find a related gene in another organism. For a query sequence of
unknown function, a matched gene may provide a clue to the function.
Alternatively, a query sequence of known function may be used to search
through sequences of a particular organism to identify a gene that may have
the same function.
Web addresses:
GCG : http://www.gcg.com/
EGCG : http://www.sanger.ac.uk/software.EGCG/
Staden : http://www.mrc-lmb.cam.ac.uk/pubseq/
NetGenics : http://www.netgenics.com/
Pangea Systems : http://www.pangeasystems.com/
CINEMA : http://www.bioinchem.ucl.ac.uk/bsm/dbbrowser/CENEMA2.1
EMBOSS : http://www.sanger.ac.uk/Software/EMBOSS/
Alfresco : http://www.sanger.ac.uk/Users/nic/alfresco.html
STUDY QUESTIONS
1. What are databases?
2. What are the types of databases?
3. What are the functions of databases?
4. What are the nucleic acid sequence databases? Give some examples.
5. What are protein sequence databases? Give some examples.
6. What are protein sequence databases about protein maintained by PIR?
7. What are structure databases? Give some example.
8. What is bibliographic database? Give some examples.
9. What is virtual library?
10. Give some names of specialized analysis packages and their uses?
11. What is database management system?
12. What are the types of database management system?
13. What is data mining?
14. What are the goals of Ensembl?
C H A P T E R
Sequence Alignment
6
The method used to analyze the similarities and differences at the level of
individual bases or amino acids with the aim of inferring structural,
functional and evolutionary relationships among the sequences is called
sequence alignment.
In simple words it is the identification of residue-residue
correspondence; any assignment of correspondence that preserves the order of
the residues within the sequences is an alignment.
The sequences of biological macromolecules are the products of
molecular evolution. When the sequences share a common ancestral
sequence, they tend to exhibit similarity in their sequences, structures and
biological functions. When a new sequence is found whose function is not
known, but, if similar sequences could be found in the databases for which
functional or structural information is available, then this can be used as a
basis of a prediction of function or structure of the new sequence.
Sequence alignment is the procedure of comparing two (pairwise
alignment) or more (multiple sequence alignment) sequences by searching
for a series of individual characters or character patterns that are in the same
order in the sequences. Two sequences are aligned by placing them in two
rows. Identical or similar characters are placed in the same column.
Nonidentical or dissimilar characters are either placed in the same column as
a mismatch, or may be placed opposite a gap in the other sequence.
The advent of high-throughput automated fluorescent DNA sequencing
technology has led to the rapid accumulation of sequence information and
provides the basis for abundant computationally derived protein sequence
data. Analysis of DNA sequences can throw light on phylogenetic
relationships, restriction sites, intron/exon prediction and gene structure and
protein coding sequence through open reading frame analysis.
6.1 ALGORITHM
Algorithm is a biological sequence of steps by which a task can be performed.
It is a set of rules for calculating or solving a problem which normally is
6.2 Basic Bioinformatics
Genetic Algorithm
The genetic algorithm is a general type of machine-learning algorithm
developed by computer scientists which has no direct relationship to biology.
It produces alignments by attempted simulation of the evolutionary changes
in sequences.
Fig. 6.1 Distinction between global and local alignments of two sequences
Sequence Alignment 6.3
Sequences that are quite similar and approximately the same length are
suitable candidates for global alignments. Local alignments are more suitable
for aligning sequences that are similar along some of their lengths but
dissimilar in others, sequences that differ in length or sequences that share a
conserved region or domain.
In the figure 6.1 the global alignment is stretched over the entire sequence
length to include as many matching amino acids as possible up to and
including the ends of sequences. Vertical bars between the sequences indicate
the presence of identical matches. In the local alignment, alignment stops at
the ends of regions of identity or strong similarity. Priority is given to finding
these local regions.
There are two types of alignment: global alignment and local alignment.
Global alignment considers the similarity across the full extent of the
sequence. Local alignment focuses on regions of similarity in parts of the
sequence only.
A search for local similarity may produce more biologically meaningful
and sensitive results than a research attempting to optimize alignment over
the entire sequence length because usually the functional sites are localized
to relatively short regions, which are conserved irrespective of deletions or
mutations in intervening parts of the sequence.
Optimal Alignment
Optimal alignment is an alignment which maximizes the score, that which
exhibits the most correspondences, and the least differences. Suboptimal
alignment is an alignment where the maximization of the score is below the
optimum level. In an optimal alignment, non identical characters and gaps
are placed to bring as many identical or similar characters as possible into
vertical register.
Optimal alignments provide useful information to biologists concerning
sequence relationships by giving the best possible information as to which
characters in a sequence should be in the same column in an alignment and
which are insertions in one of the sequences (or detections on the other). This
information is important for making functional, structural and evolutionary
predictions on the basis of sequence alignment.
Fig. 6.2 Alignment of two sequences with vertical bars and gaps. Vertical bar (|) denotes
identical matches and horizontal bar (-) denotes gap.
Sequence Alignment 6.5
Gaps and Mismatches
We could score the alignment by counting how many positions match
identically at each position. The process of alignment can be measured in
terms of the number of gaps introduced and the number of mismatches
remaining in the alignment. A comprehensive alignment must account fully
for the positions of all residues in both sequences. This means that many gaps
may have to be placed at positions that are not strictly identical. In such cases,
the positioning of gaps in the alignment becomes numerous and more
complex. If this is done, then the algorithms produce alignments containing
very large proportions of matching letters and large numbers of gaps.
Although this process achieves optimum score and is mathematically
meaningful, the result of such a process would be biologically meaningless
because insertion and deletion of monomers is relatively a slow evolutionary
process. Dynamic programming algorithms use gap penalties to maximize
the biological meaning. A simple score contains a positive additive
contribution of 1 for every matching pair of letters in the alignment and a gap
penalty is subtracted for each gap that has been introduced (different kinds
of gap penalties are there such as constant penalty, proportional penalty,
affine gap penalty which includes gap opening and gap extension penalty).
The total alignment score is then a function of the identity between aligned
residues and the gap penalties incurred.
Example
agtccgta Hamming distance = 2
ag-tcccgctca Lavenshtein distance = 3
Uses
Sequence alignment is useful to discover functional, structural and
evolutionary information in biological sequences. It is important to obtain the
best possible or optimal alignment to discover this information. Sequences
that are very much similar probably have the same function, be it a
regulatory role in the case of similar DNA molecules, or a similar
biochemical function and three-dimensional structure in the case of proteins.
Additionally, if two sequences from different organisms are similar,
there may have been a common ancestor sequence, and the sequences are
said to be homologous. The alignment indicates the changes that could have
occurred between two homologous sequences and a common ancestor
sequence during evolution.
Database similarity searching allows us to determine which of the
hundreds of thousands of sequences present in the database are potentially
related to a particular sequence of interest. The first discovery of similar
sequences was in 1983 when Doolittle and Waterfield found out that viral
Sequence Alignment 6.7
oncogene V-sis was found to be a modified form of the normal cellular gene
that encodes platelet-derived growth factor. Dynamic programming algorithms
find the best alignment of two sequences for given substitution matrices and
gap penalties. This process is often very slow.
the other that the change occurred because of random sequence variation of no
biological significance (denominator). Odds ratios are converted to logarithms
to give log odds score for convenience in multiplying odds scores of amino
acid pairs in an alignment by adding the logarithms.
Each PAM matrix is designed to score alignments between sequences
that have diverged by a particular degree of evolutionary distance.
Dayhoff and coworkers were the first to use a log-odds approach in
which the substitution scores in the matrix are proportional to the natural log
of the ratio of target frequencies to background frequencies. To estimate the
target frequencies, pairs of very closely related sequences are used to collect
mutation frequencies corresponding to 1PAM, and these data are used to
extrapolate to a distance of 250 PAMs. (Note that PAM matrices are derived
by counting observed evolutionary changes in closely related protein
sequences, and then extrapolating the observed transition probabilities to
longer evolutionary distances). It is possible to derive PAM matrices for any
evolutionary distance but in practice, the most commonly used matrices are
PAM120 and PAM250; of these two, PAM250 matrix produces reasonable
alignments.
Dot Matrix
A dot matrix analysis is primarily a method for comparing two sequences to
look for possible alignment of characters between the sequences. The method
is also used for finding direct or inverted repeats in protein and DNA
sequences, and for predicting regions in RNA that are self-complementary
and that have potential of forming secondary structure.
The major advantage of the dot matrix method for finding sequence
alignments is that all possible matches of residues between two sequences are
found, leaving the investigator the choice of identifying the most significant
ones. Then sequences of the actual region that align can be detected by using
other methods of sequence alignment, e.g. dynamic programming.
Alignments generated by these programs can be compared to the dot matrix
alignment to find out whether the longest regions are being matched and
whether insertions and deletions are located in the most reasonable places.
Detection of matching regions may be improved by filtering out
random matches in a dot matrix. Filtering is achieved by using a sliding
window to compare the two sequences at the same time. Identification of
sequence alignments by the dot matrix method can be aided by performing a
count of dots in all possible diagonal lines through the matrix to determining
statistically which diagonals have the most matches, and comparing these
match scores with the results of random sequence comparison.
Dot matrix analysis can also be used to find direct and inverted repeats
within sequences. Repeated regions in whole chromosomes may be detected.
Direct repeats may also be found by performing sequence alignments with
dynamic programming methods. A dot matrix analysis can also reveal the
presence of repeats of the same sequence character.
Dot matrix method displays any possible sequence alignments as
diagonals on the matrix. Dot matrix analysis can readily reveal the presence
of insertions/ deletions and direct and inverted repeats that are more
difficult to find by the other, more automated methods.
Dotplot is a simple visual approach to compare two sequences. It is a
table or matrix. It gives quick pictorial statement of the relationship between
two sequences. The two sequences to be compared are plotted on the X and Y
axis of a graph. Wherever a base or residue of one axis coincides with a base
or residue on the other axis, it is marked with a dot. The plot is characterized
by some apparently random dots and a central diagonal line where a high
density of adjacent dots indicates the regions of greatest similarity between the
two sequences (Fig. 6.3).
Sequence Alignment 6.13
MTFRDLLSVSFEGPRPDSSAGGSSAGG
M X
T X
F X X
R X X
D X X
L XX
L XX
S X X XX XX
V X
S X X XX XX
F
X X
E
G X
X XX XX
P
R X X
P X X
D X
S X X
S X X XX XX
A X X XX XX
G X X
G X XX XX
X XX XX
Fig. 6.3 Illustration of the manner of construction of the dotplot matrix, using a simple
residue identify matrix to score an ‘X’ where a pair of identical residues is observed.
(Source: Atwood, T.K. and Parry-Smith, D.J., Introduction to Bioinformatics, Pearson
Education Ltd., 2001)
Dynamic Programming
Dynamic programming is a computational method that is used to align two
protein or nucleic acid sequences. The method is very important for sequence
analysis because it provides the very best alignment or optimal alignment
between sequences.
The method compares every pair of characters in the two sequences and
generates an alignment. This alignment will include matched and
mismatched characters and gaps in the two sequences that are positioned so
that the number of matches between identical or related characters is the
maximum possible. The dynamic programming algorithm provides a reliable
computational method for aligning DNA and protein sequences. Both global
and local types of alignments may be made by simple changes in the basic
dynamic programming algorithm.
A global alignment program is based on the Needleman-Wunsch
algorithm and a local alignment program is based on the Smith-Waterman
algorithm. Another feature of the dynamic programming algorithm is that the
alignments obtained depend on the choice of a scoring system for comparing
character pairs and penalty scores for gaps. For protein sequences, the simple
system of comparison is one based on identity. A match in an alignment is
only scored if the two aligned amino acids are identical.
6.14 Basic Bioinformatics
Procedure
Go to the ncbi-entrez site (www.ebi.ac.uk/align). Once the home page appears
select the method local or global. Paste the sequence of interest in the text box.
Then press RUN button.
Sequence Alignment 6.15
Word or k-Tuple
The word or k-tuple methods are used by FASTA and Blast algorithms. They
align two sequences very quickly, first by searching for identical short
stretches of sequences called words or k-tuples and then by joining these
words into an alignment by the dynamic programming method. These
methods are fast enough to be suitable for searching an entire database for
the sequence that aligns best with an input test sequence. The FASTA and
BLAST methods are heuristic, i.e., an empirical method of computer
programming in which rules of thumb are used to find solutions and
feedback is used to improve performance.
In database searching, the basic operation is to align the query sequence
to each of the subject sequence in the database and if this can be done in a
faster manner, then this is better than dynamic programming algorithm
methods.
FASTA
FASTA is a DNA and protein sequence alignment software package. It was
first described by David J. Lipman and William R. Pearson in 1985 as FASTP
dealing with only protein sequences. In 1988 the ability to search DNA
sequences was added.
Procedure:
Open the internet browser and type the URL address: http://
fasta.adbj.nig.ac.jp/top.e.html. The results can be received in any Email
address.
FASTA compares nucleotide sequence with nucleotide sequence
database or amino acid sequence with amino acid sequence database. It
compares nucleotide sequence with amino acid sequence database by
translating the sequence taking into account all six possible open reading
frames. It compares amino acid sequence with nucleotide sequence database
by translating database sequences taking into account all six possible open
reading frames.
It compares amino acid sequence with nucleotide sequence database by
translating database sequence taking into account all six possible open
reading frames and frame-shift mutations.
We must specify the database in which homologous sequences are
searched. We must specify the division in which homologous sequences are
searched. We must specify how many homologous sequences are reported in
the list of homology scores. Default value is 100. We must specify how many
alignments with homologous sequences are reported. Default value is 100. We
must specify the degree of sensitivity (Ktup) of the search. Usually the Ktup
value is recommended to be set at 3-6 for nucleotide sequences and 1-2 for
amino acid sequences. Lesser the ktup value, more sensitive the search. The
k-tupl value determines how many consecutive identities are required for a
match to be declared.
6.16 Basic Bioinformatics
BLAST
BLAST (Basic Local Alignment Search Tool) program was developed by
Altschul et al. in 1990. It has become very popular because of its efficiency
and firm statistical foundation. BLAST works under the assumption that
high-scoring alignments are likely to contain short stretches of identical or
near identical letters. These short stretches are called words.
The first step in BLAST is to look for words of a certain fixed word length
W that score higher than a certain threshold score (T). The value of W is
normally 3 for protein sequences or 11 for nucleic acid sequences. BLAST takes
a word from the query sequence initially and proceeds to extend the query
sequence on either direction on the target sequence with totalling scores for
matchings, mismatchings, gap introduction and extension of gap. The
extension will continue to reach a cut off value S. BLAST extends individual
word matches until the total score of the alignment falls from its maximum
value by a certain amount producing high scoring segment pairs.
BLAST is a heuristic search algorithm employed by different BLAST
programs such as BLASTP, BLASTN, BLASTX, TBLASTX and PSI-BLAST.
BLASTP compares an amino acid query sequence against a protein sequence
database. BLASTN compares a nucleotide query sequence against a nucleotide
Sequence Alignment 6.17
sequence database. BLASTX compares six-frame conceptual translation
products of nucleotide query sequence (both strands) against a protein
sequence database. TBLASTN compares a protein query sequence against a
nucleotide sequence database dynamically translated in all six reading frames
(both strands). TBLASTX compares the six-frame translations of a nucleotide
query sequence against the six-frame translations of a nucleotide sequence
database. PSI-BLAST compares amino acid query sequence against a protein
sequence database.
The FASTA and BLAST programs are essentially local similarity search
methods that concentrate on finding short identical matches, which contribute
to a total match.
Methods
Many methods are available for applying multiple sequence alignments of
known proteins to identify related sequences in database searches. Some
important methods are: profiles, Blocks, Fingerprints, PSI-BLAST and
Hidden Markov Models (HMMs).
Profiles
Proteins of similar function usually share identical motif. Therefore, most
prediction is more useful than trying to find similarity in entire sequence of
the protein. Proteins of similar or comparable function are usually siblings of
a common ancestral protein. Often they share some amount of similarity in
the sequence, particularly in the motifs. A sequence alignment usually
supplies us such families of proteins. Such kind of multiple alignments is often
called profiles.
A profile expresses the patterns inherent in a multiple sequence
alignment of a set of homologous sequences. They have several applications:
• They permit greater accuracy in alignments of distantly-related
sequences.
• Sets of residues that are highly conserved are likely to be part of the
active site, and give clues to function.
• The conservation patterns facilitate identification of other homologous
sequences.
• Patterns from the sequences are useful in classifying subfamilies
within a set of homology.
Sequence Alignment 6.19
• Sets of residues that show little conservation, and are subject to
insertion and deletion, are likely to elicit antibodies that will cross-
react well with the native structure.
• Most structure-prediction methods are more reliable if based on a
multiple sequence alignment than on a single sequence. Homology
modeling, for example, depends crucially on correct sequence
alignment.
To use profile patterns to identify homologs, the basic idea is to match the
query sequences from the database against the sequences in the alignment
table, giving higher weight to positions that are conserved than to those that
are variable.
In the profiles database, there is a distilling of the sequence information
available within complete alignments into scoring tables or profiles. Profiles
define which residues are allowed at given positions, which positions are
highly conserved and which degenerate; and which positions or regions can
tolerate insertions.
Once multiple sequence alignment is performed, a portion of the
alignment which is highly conserved is then identified and a type of scoring
matrix called a profile is produced. A profile includes scores for amino acid
substitutions and gaps (matches, mismatches, insertions, deletions) in each
column of the conserved region so that an alignment of the region to a new
sequence can be determined.
BLOCKS
The blocks concept is derived from motif, the conserved stretch of amino acids
that confer specific function or structure to the protein. If motifs of a protein
family are aligned without introducing gaps in the sequences, we get blocks.
In the BLOCKS database, conserved motifs, or blocks, are located by
searching for spaced residue triplets and a block score is calculated using the
BLOSUM62 substitution matrix. The validity of blocks found by this method is
confirmed by the application of second motif-finding algorithm, which
searches for the highest-scoring set of blocks that occur in the correct order
without overlapping. Blocks within a family are converted to position-
specific matrices which are used to make independent database searches.
Like the profiles, blocks represent conserved regions in the multiple
sequence alignment. Blocks differ from profiles in lacking insert and delete
positions in the sequences. Every column includes only matches and
mismatches (Substituted position without gaps).
Fingerprints
Within a sequence alignment, it is unusual to find not one, but several motifs
that characterize the aligned family. Diagnostically, it makes sense to use
many or all of the conserved regions to create a signature or fingerprint, so
that in a database search, there is a higher chance of identifying a distant
relative, whether or not all parts of the signature are matched. Protein
6.20 Basic Bioinformatics
fingerprints are groups of motifs that represent the most conserved regions of
multiple sequence alignments.
PSI-BLAST
PSI-BLAST (Position Specific Iterated –BLAST) incorporates elements of both
pairwise and multiple sequence alignment methods. Following an initial
database search, PSI-BLAST allows automatic creation of position-specific
profiles from groups of results that match the query above a defined
threshold. Running the program several times can further refine the profile
and increase search sensitivity.
HMMs
Hidden Markov Models (HMMs) is a statistical model that considers all
possible combinations of matches, mismatches, and gaps to generate an
alignment of a set of sequences. A localized region of similarity, including
insertions and deletions, may also be modeled by an HMM.
HMMs are probabilistic models consisting of a number of
interconnecting states: they are essentially linear chains of match, delete or
insert states, which can be used to encode sequence conservation within
alignments. HMMs are the basis of the Pfam database.
A HMM is a computational structure for describing the subtle patterns
that define families of homologous sequences. HMMs are powerful tools for
detecting distant relatives, and for prediction of protein folding patterns.
HMMs include the possibility of introducing gaps into the generated
sequence, with position-dependent gap penalties and they carry out the
alignment and the assignment of probabilities together.
Automatic Alignment
Central to sequence analysis is the multiple alignment. Consequently a vital
tool for the sequence analyst is an alignment editor. Several automatic
alignment programs are available now, either in a stand-alone form (such as
ClustalW) or as components of larger packages (such as Pileup in GCG). But
automatically calculated alignments almost invariably require some degree of
manual editing, whether to remove spurious gaps, to rescue residue windows,
or to correct misalignments. This often presents problems, as there is currently
no standard format for alignments.
Consequently, swapping between alignment programs is almost
impossible without the use of ad hoc scripts to convert between disparate
input and output formats. The advent of the object-oriented network
programming language, Java, addresses some of these problems. Java capable
browsers may run applets on a variety of platforms - applets are small
applications bonded from a server via HTML pages; the software is loaded on-
the-fly from the server and cached for that session by the browser.
Sequence Alignment 6.21
CLUSTAL
CLUSTAL performs a global multiple sequence alignment using the following
steps:
(i) Perform pairwise alignments of all of the sequences
(ii) Use the alignment scores to produce a phylogenetic tree
(iii) Align the sequences sequentially, guided by the phylogenetic
relationships indicated by the tree.
CLUSTAL approach exploits the fact that similar sequences are likely to
be evolutionarily related. It aligns sequences in pairs, following the
branching order of a family tree. Similar sequences are aligned first and more
distantly related sequences are added later. Once pairwise alignment scores
for each sequence relative to all others have been calculated, they are used to
cluster the sequences into groups which are then aligned against each other
to generate the final multiple alignment.
CLUSTAL has been revised many times. CLUSTAL W uses the
positioning of gaps in closely related sequences to guide the insertion of gaps
into those that are more distant. Similarly, information compiled during the
alignment process about the variability of the most similar sequences is used
to help vary the gap penalties on a residue and position specific basis.
CINEMA
CINEMA is a Colour Interactive Editor for Multiple Alignments, written in
Java: the program allows creation of sequence alignments by hand,
generation of alignments automatically (e.g. using ClustalW), and
visualization and manipulation of sequence alignments currently resident at
different sites on the Internet. In addition to its special advantage of allowing
interactive alignment over the web, CINEMA provides links to the primary
data sources, thereby giving ready access to up-to-date data, and a gateway
to related information on the Internet.
CINEMA is more than just a tool for colour-aided alignment preparation.
The program also offers facilities for motif modification; database searching
(using BALST); 3D-structure visualization (where co-ordinates are available),
allowing inspection of conserved features of alignments in a 3D context;
generation of dotplots and hydropathy profiles; six-frame translation; and so
on. The program is embedded in a comprehensive help-file (written in HTML)
and is accessible both as a stand-alone tool from the DbBrowser Bioinformatics
Web Server, and as an integral part of the PRINTS protein fingerprint
database.
READSEQ
READSEQ is a very useful sequence format conversion tool. D.G. Gilbert from
the Biology Department of Indiana University, USA programmed this in 1990
to read the formatted sequence files and convert the sequence information in
the files into another file that has a different format. It automatically detects
6.22 Basic Bioinformatics
Procedure
Open the internet browser and type the URL address: http://www.bimas.cit
.nih.gov/molbio.readsec/. Pull the drop down menu and select the desired
format. Paste the sequence in the text box. Press SUBMIT or RUN button.
proteins, one can recognize patterns that allow one to infer relationships with
previously characterized families. Similarly, by searching fold libraries, which
contain templates of known structures, it is possible to recognize a previously
characterized fold.
Given the size of existing sequence databases, it is likely that searches
with new sequences will uncover homologues; and, with the expansion of
sequence pattern and structure template databases, the chances of assigning
functions and inferring possible fold families are also improving. However,
these advances in sequence and fold pattern recognition methods have not
yet been matched by similar advances in prediction techniques. So if one
cannot predict function or structure directly from sequence, but can identify
homologues and recognize sequence and fold patterns that have already
been seen, given the bewildering array of databases to search, how does one
use this information to build a sensible search method for novel sequences?
Essentially, one has to check identical matches and then move on to
search for closely similar sequences in the primary databases. The strategy
then involves searching for previously characterized sequence – and, where
possible, fold patterns in a variety of pattern databases. The final step is the
integration of results from all these searches to build a consistent family/
functional/structural diagnosis. An interactive www tutorial, known as
BioActivity can be found at: http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/prefacefrm.html
The first and fastest test to identify an unknown protein sequence
fragment is to perform an identity search, preferably of a composite sequence
database. OWL is a composite resource that can be queried directly by means
of its query language. Identity searches, which are suitable for peptides up to
30 residues in length, are possible via web interface; this provides an easy-to-
use form that conveniently shields the user from the syntax of the query
language. An identity search will reveal in a matter of seconds whether an
exact match to the unknown peptide already exists in the database. The
following website is useful. http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/ nucleicfrm.html
If an identity search fails to find a match, the next step is to look for
similar sequences again preferably in a composite database. For best results it
is recommended to perform similarity searches on peptides that are longer
than 30 residues (shorter the peptide, the greater the likelihood of finding
chance matches that have no biological relevance). In most applications as
much sequence information as possible should be used in a BLAST search
(although this can lead to complications in interpreting output from searches
with multi-domain or modular proteins).
There are several important features to note in the BLAST output. First,
one is looking for matches that have high scores with correspondingly low
probability values. A very low probability indicates that a match is unlikely to
have arisen by chance. As the probability values approach unity, they are
considered more and more likely to be random matches. The second feature
Sequence Alignment 6.25
of interest is whether the results show a cluster of high scores (with low
probabilities) at the top of the list, indicating a likely relationship between the
query and the family of sequences in the cluster.
Heuristic search tools like BLAST do not always give clear-cut answers.
Frequently the program will not be able to assign significant scores to any of
its retrieved matches, even if a biologically relevant sequence appears in the
hit-list. Such search tools do not have the sensitivity always to fish out the
right answer from the vast amount of sequences in the primary database;
rather, they cast a coarse net, and it is then up to the user to pick out the best.
Under these circumstances, where no individual high-scoring sequence
or cluster of sequences, is found, the third feature to consider is whether
there are any observable trends in the type of sequences matched, i.e. do the
annotations suggest that several of these are from a similar family? If there
are possible clues in the annotations, the next step is to try to confirm these
possibilities both by reciprocal BLAST searches (do retrieved matches
identify the sequence in a similarity search?), and by comparing results from
searches of the secondary databases.
The first secondary database to consider is PROSITE. Within the tutorial, this
is accessible for searching via the ‘Protein sequence analysis-Secondary
database searches page: http://www.bioinf.man.ac.uk/dbbrowser/
bioactivity/protein1frm.html
The database code is simply supplied to the relevant part of the form
and the option to exclude patterns with a high probability of occurrence (i.e.
rules) is switched on.
The next step is to search the ISREC profile library. In addition to the
profiles that have already been incorporated into the main body of PROSITE,
the web server offers a range of pre-release profiles that have not yet been
sufficiently documented for release through PROSITE. Searching the complete
collection of profiles is achieved, once again, by simply supplying the database
code to the web form, remembering to change the format button from the
default (plain text) to accept a SWISS-PROT ID: http://www.bioinf.man.ac.uk/
dbbrowser/bioactivity/protein1frm.html
Another important resource to search is the Pfam collection of Hidden
Markov Models. Searching is achieved via web interface that requires the
query sequence to be supplied to a text box: http://www.bioinf.man.ac.uk/
dbbrowser/bioactivity/protein1frm.html
The sequence must be in FASTA format, which means that the query
must be preceded by the > symbol and a suitable sequence name.
Another key secondary resource is PRINTS, which provides a bridge
between single-motif search methods, such as the one used to compile
PROSITE, and domain-alignment/profile methods, such as those embodied in
the profile library and Pfam. PRINTS is accessible for searching via the
‘Protein sequence analysis – protein fingerprinting’ page: http://
www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein2frm.html
6.26 Basic Bioinformatics
The output is divided into distinct sections; first, the program offers an
intelligent ‘guess’ based on the occurrence of the highest-scoring complete or
partial fingerprint match or matches; it then provides an expanded calculation
that shows the top 10 best-scoring matches-clearly; these include the
intelligent results from the previous analysis, but the additional matches are
provided to highlight why the best guess was chosen, and to allow a different
choice, if the guess is considered either to be wrong or to have missed
something; the remaining sections of output provide more of the new data,
again allowing the users to search for anything that might have been missed.
A particularly valuable aspect of this software is the facility to visualize
individual fingerprint matches by clicking on the graphic box.
The next secondary resources to be searched are the BLOCKS database,
derived from PROSITE and PRINTS. If results matched in PROSITE and/or
PRINTS are true-positive, then we would expect these to be confirmed by the
BLOCKS search results. The BLOCK databases are searched by supplying the
query sequence to the input box of the relevant web form: http://
www.bioinf.man.ac.uk/dbbrowser/bioactivity/protein1frm.html
One must remember in each case to switch to the required database.
The accession codes in the Block column indicate the number of motifs;
matches to these motifs are ranked according to score. The ‘rank’ of the best-
scoring block, the so-called anchor block is reported. Where additional blocks
support the anchor block by matching with high scores in the correct order, a
probability value is calculated, reflecting the likelihood of these matches
appearing together in an order. Often results are littered with matches with
high–scoring individual blocks. These matches are usually the result of
chance, and p-values are not calculated. The information content of particular
blocks can be visualized by examination of the sequence logo.
A sequence logo is a graphical display of a multiple alignment consisting
of colour-coded stacks of letters representing amino acids at successive
positions. The height of a given letter increases with increasing frequency of
the amino acid, and its height increases with increasing conservation of the
aligned position; hence, letters in stacks with single residues (i.e. representing
conserved positions) are taller than those in stacks with multiple residues (i.e.
where there is more variation).
Within stacks, the most frequently occurring residues are not only taller,
but also occupy higher positions in the stack, so that the most prominent
residue at the top is the one predicted to be the most likely to occur at that
position. To address the problem of sequence redundancy within block, which
strongly biases residue frequencies, sequence weights are calculated using a
position-specific scoring matrix (PSSM). This reduces the tendency for over-
represented sequence to dominate stacks, and increases the representation of
rare amino acids relative to common ones.
The final resource is IDENTIFY, which is searched by supplying the
query sequence to the relevant web form: http://www.bioinf.man.ac.uk/
dbbrowser/bioactivity/protein1frm.html
Sequence Alignment 6.27
We can find out more about the structure, either by following the links
embedded in the PROSITE and PRINTS entries or by supplying a relevant
PDB code in the query forms of the structure classification resources (such as
SCOP and CATH). SCOP is accessible for searching via the ‘protein structure
analysis–structure classification resources’ page:
http://www.bioinf.man.ac.uk/dbbrowser/bioactivity/structurefrm.html
The CATH resource is queried by supplying the desired PDB html code
to the relevant form on the same web page. Clicking on the hyper linked PDB
code in the CATH summary takes to the PDBsum resource, a web based
collection of information for all PDB structures. The picture of the overall
fold and secondary structure of the molecule is available here. Using this
pictorial information, one can begin to rationalize the results of the secondary
database searches in terms of structural and functional features of the 3D
molecule, essentially by superposing the motifs matched in PROSITE,
PRINTS and BLOCKS on to the sequence.
STUDY QUESTIONS
1. What is sequence alignment?
2. What are the goals of sequence alignment?
3. What are the types of sequence alignments?
4. How is dotplot analysis performed?
5. How is pairwise comparison done?
6. How mutations, deletions and substitutions are scored?
7. Which programs are used for pairwise database searching?
8. What is multiple sequence alignment?
9. Enumerate the key steps in building multiple alignment
10. Which are the programs used in multiple alignment?
11. How can one carry out a sequence search?
12. What is a string?
13. What is Hamming distance?
14. What is Lavenshtein (edit) distance?
C H A P T E R
Since sequencing whole genomes has been achieved with greater ease today,
deriving biological meaning from the long sequences of nucleotides that are
obtained through sequencing becomes a crucial biological research problem.
Annotation is a word that is commonly used today to mean ‘deriving useful
biological information’ from raw elements in genomic DNA (structural
annotation) and then assigning functions to these sequences (functional
annotation).
With the advent of whole-genome sequencing projects, there is
considerable use for computer program that scan genomic DNA sequences to
find genes, particularly those that encode proteins. Once a new genome
sequence has been obtained, the most likely protein-encoding regions are
identical and the predicted proteins are then subjected to a database similarity
search.
Prediction is an important component of bioinformatics. Assignment of
structures to gene products is a first step in understanding how organisms
implement their genomic transformation. Prediction helps to understand the
structures of the molecules encoded in a genome, their individual activities
and interactions and the organization of these activities and interactions in
space and time during the lifetime of the organism.
Categories
Gene finding strategies can be grouped into three categories, namely, content-
based, site-based and comparative. Content-based methods rely on the overall,
bulk properties of a sequence in making a determination. Characteristics
considered here include how often particular codons are used, the periodicity
of repeats, and the compositional complexity of the sequence. Because different
organisms use synonymous codons with different frequency, such clues can
provide insight into determining regions that are more likely to be exons.
Predictions Sequence
from mRNA motif
and its EST
properties
cDNA
Predictions
from docking Promoter Splice Translation Splice Translation Polyadenylation
site analysis site sites termination sites termination site
programs site site
Fig. 7
7..1 The different forms of gene product evidence – cDNAs, ESTs, BLAST similarity hits,
codon bias, and motif hits – are integrated to make genes predictions. Where multiple
classes of evidence are found to be associated with a particular genomic DNA sequence,
there is a greater confidence in the likelihood that a gene prediction is accurate. (Source:
A.J.F. Griffiths et al., Modern Genetic Analysis, W.H. Freeman and Company, 2002).
Predictive Methods using DNA and Protein Sequences 7.3
Site-based methods focus their attention to the presence or absence of a
specific sequence, pattern, or consensus. These methods are used to detect
features such as donor and acceptor splice sites, binding sites for transcription
factors, poly A tracts and start and stop codons. Comparative methods make
determinations based on sequence homology. Hence translated sequences are
subjected to database searches against protein sequences to determine whether
a previously characterized coding region correspond to the region in the query
sequence.
The simplest method of finding DNA sequences that encode proteins is
to search for open reading frames. An ORF is a length of DNA sequence that
contains a contiguous set of codons, each of which specifies an amino acid.
Web Addresses
FGENEX : http://genomic.sanger.ac.uk/gf/gf.shtml
GeneID : http://www1.imim.es/geneid.html
GeneParser : http:// beagle.colordo.edu/~eesnyder/GeneParser.html
GENSCAN : http://genes.mit.edu/GENSCAN.html
GRAIL : http:// compbio.ornl.gov/tools/index.shtml
GRAIL-Exp : http:// compbio.ornl.gov/grailexp/
HMMgene : http://www.cbs.dtu.dk/services/HMMgene/
MZEF : http://www.cshl.org/genefinder
PROCRUSTES : http://www-hto.usc.edu/software/procrustes
Prediction Methods
Modeling the structure of biological macromolecules allows us to gain a great
deal of insight into the molecule’s functional features. Modeling unknown
protein structures based on their homologs is known as homology-based
Predictive Methods using DNA and Protein Sequences 7.5
structural modeling. In this type of modeling, the experimentally determined
structures are generally referred to as the ‘templates’ and the sequence
homology (a novel one) that lacks structural coordinates is called the ‘target’
sequence.
The homology-based protein modeling approach entails four sequential
steps. The first step involves the identification of known structures that are
related in sequence to the target sequence using BLAST. In the second step, the
potential templates are aligned with the target sequence to identify the closest
related template. In the third step, a model of the target sequence is calculated
from the most suitable template in step two. The fourth step involves the
evaluation of the modeled target sequence using different criteria.
The knowledge of evolutionarily conserved structural features of similar
proteins from other species enables us to gain insight into the structure of the
target sequence.
The observation that each protein folds spontaneously into a unique three-
dimensional native conformation implies that nature has an algorithm for
predicting protein structure from amino acid sequence. Some attempts to
understand this algorithm are based solely on general physical principles;
others are based on observations of known amino acid sequences and protein
structures. A proof of our understanding would be the ability to reproduce the
algorithm in a computer program that could predict protein structure from
amino acid sequence.
Most attempts to predict protein structure from basic physical principles
alone try to reproduce the inter-atomic interactions in proteins, to define a
compatible energy associated with any conformation. Computationally, the
problem of protein structure prediction then becomes a task of finding the
global minimum of this conformational energy function. So far this approach
has not succeeded, partly because of the inadequacy of the energy function
and partly because the minimization algorithms tend to get trapped in local
minima.
The alternative to a priori methods is the approach based on assembling
clues to the structure of a target sequence by finding similarities to known
structures. These are empirical or knowledge-based methods.
180
aL
GG
G
y 0
G G
G GG
–180 G
–180 0 180
f
Chou-Fasman Method
Chou-Fasman method is based on the assumption that each amino acid
individually influences secondary structure within a window of sequence. It is
based on analyzing the frequency of each of the 20 amino acid in α-helices,
β-strands and turns. To predict a secondary structure, the following set of rules
is used.
The sequence is first scanned to find a short sequence of amino acids that
has a high probability for starting a nucleation event that could form one type
of structure. For α-helices, a prediction is made when four of six amino acids
have a high probability of > 1.03 of being in an α-helix. For β-strands, the
presence in a sequence of three of five amino acids with a probability of >1.00
of being in a β-strand. These nucleated regions are extended along the
sequence in each direction until the prediction values for four amino acids
drop below 1. If both α-helices, β-strand regions are predicted, the higher
probability prediction is used.
Turns are predicted a little differently. Turns are modeled as a
tetrapeptide, and two probabilities are calculated. First, the average of the
probabilities for each of the four amino acids being in a turn is calculated as
for the α-helix and β-strand prediction. Second the probabilities of amino acid
combinations being present at each position in the turn of tetrapeptide are
determined.
These probabilities for the four amino acids in the candidate sequence
are multiplied to calculate the probability that the particular tetrapeptide is a
turn. A turn is predicted when the first probability value is greater than the
probabilities for an α-helix and β-strand in the region and when the second
probability value is greater than 7.5 × 10-5.
GOR Method
GOR method is based on the assumption that amino acids flanking the central
amino acid residue influence the secondary structure that the central residue is
likely to adopt. It uses the principles of information theory to derive
predictions. Known secondary structures are scanned for the occurrence of
amino acids in each type of structure. The frequency of each type of amino acid
at the next 8 amino-terminal and carboxy-terminal positions is also
determined, making the total number of positions examined equal to 17,
including the central one.
Predictive Methods using DNA and Protein Sequences 7.9
Neural Network Prediction
In the neural network approach, computer programs are trained to be able to
recognize amino acid patterns that are located in known secondary structures
and to distinguish these patterns from other patterns not located in these
structures. These neural network models extract more information from
sequences theoretically. PHD and NNPREDICT are two neural network
programs. Neural network models are meant to simulate the operation of the
brain.
Nearest-neighbor Prediction
Like neural networks, nearest-neighbor methods are also a type of
machine learning method. They predict the secondary structural conformation
of an amino acid in the query sequence by identifying sequences of known
structures that are similar to the query sequence. A large list of short sequence
fragments is made by sliding a window of varied length along a set of
approximately 100-400 training sequences of known structure.
The minimal sequence similarity to each other and the secondary
structure of the central amino acid in each window is recorded. A window of
the same size is selected from the query sequence and compared to each of the
above sequence fragments, and the 50 best matching fragments are identified.
The frequencies of the known secondary structure of the middle amino acid in
each of these matching fragments are then used to predict the secondary
structure of the middle amino acid in the query window.
Table 7.1: Helical and strand propensities of the amino acids. A value of 1.0
indicates that the preference of that amino acid for the particular secondary
structure is equal to that of the average amino acid; values greater than one indicate
a higher propensity than the average; values less than one indicate a lower
propensity than the average (The values are calculated by dividing the frequency
with which the particular residue is observed in the relevant secondary structure by
the frequency for all residues in that secondary structure).
The accuracy of these early methods based on the local amino acid
composition of single sequences was fairly low, with often less than 60% of
residues being predicted in the correct secondary structure state.
7.2.7 Threading
Threading is a method for fold recognition. Given a library of known
structures and a sequence of a query protein of unknown structure, does the
query protein share a folding protein? Threading is a technique to match a
sequence with a protein shape. Threading is based on the observation that
even proteins that have very low sequence identity often have similar
structures.
Threading may be used in the absence of any substantial sequence
identity to proteins of known structure, whereas, comparative modeling
requires protein structures that have substantial sequence similarity to the
protein sequence of interest. The sequence of interest is matched against a
database of known folds and the protein is assumed to have the same fold as
the best match.
Theoretical considerations indicate that the total number of possible
folds for proteins is limited. Hence it is possible to predict the structure of a
representative protein for each possible fold. The basic idea of threading is to
build many rough models of the query protein, based on each of the known
structures and using different possible alignments of the sequences of the
known and unknown proteins.
Threading approaches may be on sequence information, structural
information or both. The two essential components of threading are: (a) finding
an optimal alignment (with gaps) of a sequence onto a structure and (b)
scoring different alignments and deciding on the best shape. Scoring may be
carried out by (i) mapping the structural information to create a profile for each
structural site, or (ii) using a potential based on pairwise interactions.
In general, the models based on pairwise interactions have greater
discriminatory ability. However, it is more difficult and more computationally
expensive to find an optimal alignment using a pairwise interaction potential.
7.14 Basic Bioinformatics
7.2.8 Energy-based Prediction of Protein Structure
The essence of energy based approaches to compute the conformation
dependent potential energy for different conformations; the conformation with
the lowest energy is assumed to be the structure of the molecule under
investigation. The form of the potential energy function is based upon the
known physics of interacting bodies. The potential energy function contains
terms corresponding to well understood interactions such as coulombic
interaction between charged bodies, terms for interaction between polarizable
atoms etc.
In the case of force fields that are variable geometry, terms are included
for deviations from an assumed ‘ideal geometry’. The ideal geometries for
different residues are defined, based on examination of high-resolution
structures of model compounds. The parameters for the potential energy
function may be obtained from ab initio quantum mechanical calculations or
from thermodynamic, spectroscopic or crystallographic data or a combination
of these three.
Ab initio based attempts to locate global energy minimum have been less
successful than the knowledge based approaches. The reason for this has
been: (i) the inaccuracy of existing energy functions and (ii) the computational
difficulty in searching for the global minimum.
The development of energy based methods (in particular force field based
fully on the physics of interacting bodies and capable of recognizing the native
structure as the lowest-energy one) would be a major step forward towards
understanding the role of particular interactions in the formation of protein
structure and the mechanisms of protein folding.
For practical reasons, a global minimum search of real-size proteins is
unfeasible at the all-atom level; therefore united-residue models of polypeptide
chains have received greater attention. After the global minimum is found at
the united-residue level, it can be converted to all-atom representation and
limited exploration of the conformational space in the neighborhood of the
converted structure. The whole approach is referred to as hierarchical
approach to protein folding.
Domains
Certain proteins contain specific modules that mediate protein-protein
interactions. The identification of such domains in a particular protein can
provide clues about its interacting partners. For example, the presence of an
SH2 domain or a PTB domain in a protein indicates that it will bind to another
protein containing phosphotyrosine residue.
The presence of the monomeric PD2 domain indicates that it might
interact either with another protein that contains a PDZ/LIM domain or with
the C-terminal region of membrane proteins. The presence of a Pleckstrin
homology domain in a protein indicates that it is likely to be involved in signal
transduction and that it might bind to the acid rich regions of protein involved
in signal transduction or to phosphoinositides.
In X-ray diffraction studies of crystals, the technique of molecular
replacement is used to obtain an initial set of phases. If a protein that shares
substantial sequence similarity with the protein of interest is available in the
database, then its structure may be used for building a model of the protein of
interest using comparative modeling.
The coordinates of the atoms in this structure can be used for calculating
the structure factors. The phase of the resulting structure factors and the
measured values of the magnitudes of the structure factors are then used for
calculation of a new electron density model. The resulting model can then be
subjected to Fourier or least-squares refinement.
Procedure
A. When we want to measure the band length the following steps can be used:
1. Open RasMol and load a file of pdb atom coordinates (downloaded
from the PDB databank).
7.18 Basic Bioinformatics
2. Use various menu options to get a feel of the molecule.
3. Open RasTop, the molecular visualization tool.
4. From the file menu, open a PDB atom coordinate file.
5. Roatate the molecule.
6. Use the options in the menu and command line.
7. Set the display style to ball and stick.
8. Zoom the molecule to visualize the bonds better (shift + mouse down).
9. Go to command line window
10. Type: Set picking distance and press Enter key.
11. Go to the display window and select the two atoms participating in the
bond formation by clicking on them successively.
12. The bond length will appear in the command line window.
13. Note down the results.
If we want to show a bond and measure the band length between
two atoms, we can also use the following after going to command
line window.
14. Type set picking monitor in the command line window.
15. Click on the atom again (only once). A band line will appear.
16. Note down the results from the command line window.
B. When we want to measure band angle the following steps can be used
1. Set the display style to ‘ball and stick’
2. Zoom the molecule (shift + mouse down)
3. Go to command line window
4. Type set picking angle and press
5. Go to display window and select the three atoms forming the bond
angle by clicking on them successively.
6. The bond angle will appear in the command line window.
7. Note down the results
C. When we want to measure the torsion angle the following steps can be used:
1. Set the display style to ‘ball and stick’
2. Zoom the molecule (shift + mouse down)
3. Go to command line window
4. Type set picking torsion and press Enter key
5. Go to display window and select the four atoms forming phi and psi
angle by clicking on them successively.
[The clik sequence for phi is: carbonyl C of residue (i-1), N of residue
i,CA of residue i, and carbonyl C of residue i. The click sequence for
psi is: N of residue (i+1), CA of residue i, carbonyl C of residue i, and
N of residue (I+1)].
Predictive Methods using DNA and Protein Sequences 7.19
6. After successive clickings the torsion angle will appear in the command
line window.
7. Note down the results
RasTop is available on window, Linux and Mac platforms. To
install extract the RasTop folder from the RasTop. Zip file and
install in any directory. To start RasTop, double click on the RasTop
icon. It will display a single main window with one empty graphic
window, the color window and the command line window.
To view the molecule we have to load the correct file after choosing
the correct path. Then we can click molecule to select information
about the molecule. In the command line use the option Show to get
information about world, atom selection, group selection, chain
selection, coordinates, phi, psi, Ramprint, sequence, symmetry, etc.
The main menu window has click Atoms button. Select Spacefill
and display; after click atom select lablels and display. The RasMol
’Spacefill’ is used to represent all of the currently selected atoms as
solid sphere. This command is used to produce both union-of-
spheres and ball-and-stick models of a molecule. [The following
command line uses RasMol and RasTop: spacefill <boolean>],
spacefill temperature, spacefill user, spacefill [-] <value>
To know the bonds click Bonds and select Hbonds and display. In
3D structure dotted lines represent Hbonds; after viewing, close
bond by clicking remove button. To see the display of the loaded
protein of the ribbon form (a smooth solid ribbon surface, passing
along the backbone of the protein) click ribbon and select ribbons
simultaneously working with others such as strands, cartoons,
Trace and Backbone. After display click Remove button. We can
learn more about RasTop by exploring ‘Help RasTop’.
Chime and Protein explorer are derivatives of RasMOl that allow visualization
inside web browsers. Hence, it can be used only online. Chime can be reached
at www.Umass.edu/microbio/Chime
MolMol stands for Molecule analysis and Molecule display. MolMol is a
molecular graphics program for display, analysis and manipulation of three-
dimensional structures of biological macromolecules with special emphasis on
NMR solution structures of proteins and nucleic acids. MolMol can be reached
at www.mol.biol.ethz.ch/ wuthrich/software/molmol
Kinemage (kinetic images) allows the user to move two molecules or
parts of a molecule complex, relative to each other. Molscript is a tool for
making cartoons of secondary structural elements. Grasp is used for
visualization of the surface. Swiss-pdbviewer produces high quality images
using ray tracing methods. Insight II is a commercial software that also
supports hardware for interactive 3D viewing.
7.20 Basic Bioinformatics
Some websites
ComputepI/MW : http://www.expasy.ch/tools/pi_tool.html
MOWSE : http://srs.hgmp.mrc.ac.uk/cgi-bin/mowse
PeptideMass : http://www.expasy.ch/tools/peptide-mass.html
TGREASE : http://ftp.virginia.edu/pub/fasta/
SAPS : http://www.isrec.isb-sib.ch/software/SAPS_form.html
AACompIdent : http://www.expasy.ch/tools.aacomp/
AACompsim : http://www.expasy.ch/tools/aacsim/
ROPSEARCH : http://www.embl-heidelberg.de/prs.html
BLOCKS : http://blocks.fhcrc.org
Pfam : http://www.sanger.ac.uk/software/Pfam/
PRINTS : http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html
ProfileScan : http://www.isrec.isb-sib-ch/software/PFSCAN-form.html
npredict : http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
PredictProtein : http://www.embl-heidelberg.de/predictprotein/
SOPMA : http://pbil.ibcp.fr/
Jpred : http://jura.ebi.ac.uk:8888/
PSIPRED : http://insulin.brunel.ac.uk/psipred
PREDATOR : http://www.embl-heidelberg.de/predator/predator_ifno.html
COILS : http://www.ch.embnet.org/software/COILS_form.html
PHDtopology : http://www.embl-heidelber.de/predictprotein
SignalP : http://www.cbs.dtu.dk/services/signalP/
Tmpred : http://www.isrec.isb-sib.ch/ftp-server/tmpred/www/TMPRED_form.html
DALI : http://wwwz.ebi.ac.uk/dali/
SWISS-MODEL : http://www.expasy.ch/swissmod/SWISS-MODEL.html
TOPITS : http://www.embl-heidelberg.de/predictprotein/
STUDY QUESTIONS
1. What are the uses of prediction?
2. What are the strategies used in gene prediction?
3. How do we predict mRNA structure?
4. Give examples of some of the commonly used methods for gene
prediction.
5. What is the necessity to predict protein structures?
6. What is Ramachndran plot? What are its uses?
7. How do we predict secondary structure?
8. Describe the intrinsic tendency of amino acids to form b-turns.
9. What is Rotamer Library?
10. Distinguish between ab initio and knowledge-based methods of
prediction?
Predictive Methods using DNA and Protein Sequences 7.21
11. How is comparative modeling done?
12. What are the steps involved in comparative modeling?
13. What is threading?
14. What is energy-based prediction?
15. How is protein function prediction done?
16. Give examples of some protein prediction programs.
17. What is molecular visualization? Give some examples of programs for
molecular visualization?
C H A P T E R
Rattlesnake
Bread mold
Kangaroo
Candidda
Humans
Donkey
Monkey
Chicken
Penguin
Pigeon
Horse
Turtle
Dog
Tuna
Pig
Rabbit
Mammals
Vertebrates
Insects
Animals
Fungi
Rode Man
a clade
Chimpanzee
Methods
Three methods – maximum parsimony, distance and maximum likelihood –
are generally used to find the evolutionary tree or trees that best account for the
observed variation in a group of sequences.
Distance Method
In distance matrix methods, all possible sequence alignments are carried out to
determine the most closely related sequences, and phylogenetic trees are
constructed on the basis of these distance measurements.
The distance method employs the number of changes between each pair
in a group of sequences to produce a phylogenetic tree of the group. The
sequence pairs that have the smallest number of sequence changes between
them are termed ‘neighbors’. On a tree, these sequences share a node or
common ancestor position and are each joined to that node by a branch.
The goal of distance methods is to identify a tree that positions the
neighbors correctly and that also has branch lengths which reproduce the
original data as closely as possible. The success of distance methods depends
Homology, Phylogeny and Evolutionary Trees 8.7
on the degree to which the distances among a set of sequences can be made
additive on a predicted evolutionary tree. The most commonly applied
distance based methods are the unweighted pair group method with
arithmetic mean (UPGMA), neighbor-joining (N) and methods that optimize
the additivity of a distance tree, including the minimum evolution (ME)
method. Distance analysis programs in PHYLIP are FITCH, KITCSCH and
NEIGHBOR.
(a) Human
Primate Chinpanzee
Gorilla
(b)
Eukarya
LCA Archaea
Bacteria
Fig. 8.3 Rooted trees for (a) three great apes with an unspecified primate ancestor, and
(b) the three major forms of life on this planet. Archaea were previously called the archae
bacteria. Bacteria were previously called eubacteria and by Eukarya we refer to the nuclear-
cytoplasmic system in eukaryotes (Organelles are ignored). LCA is the last common ancestor
of all life on this planet. (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios
Scientific Publishers Ltd., 2003)
Echinoderms (Starfish)
Cephalochordates (Amphioxus)
Amphibians (Frog)
Mammals (Human)
Reptiles (Lizard)
Birds (Chicken)
Fig. 8.4 Phylogenetic tree of vertebrate and our closest relatives. Chordates, including
vertebrates, and echinoderms are all deuterostomes (Source: Lesk, A.M., Introduction to
Bioinformatics, Oxford University Press, 2003).
8.10 Basic Bioinformatics
Special Features of Trees
Trees have some special features:
(i) Nodes are of two types – ancestral and terminal (leaves, tips). Ancestral
nodes may or may not correspond to a known species. Ancestral nodes
give rise to branches. They may link to other ancestral nodes, or they
may link to terminal nodes which represent known species. Terminal
nodes mark the end of the evolutionary pathway.
(ii) Trees may be rooted or unrooted. When the position of the ancestor is
indicated, it is called rooted tree. When the position of the ancestor is
not indicated, it is called unrooted tree.
(iii) Each tree is binary. Evolution of species is represented as a series of
bifurcations.
(iv) The length of the branches may or may not be significant.
Distance-based methods
Distance-based methods compute pairwise distances according to some
measure and then discard the actual data, using only the fixed distances to
derive trees that optimize the distribution of the actual data patterns for each
character. Here the pairwise distances are not fixed but they are determined by
the tree topology.
Distance-based methods use the amount of dissimilarity (distance)
between two aligned sequences to derive trees. A distance method would
reconstruct the true tree if all genetic divergence events were accurately
recorded in the sequence. However, divergence encounters an upper limit as
sequences become mutationally saturate. Unweighted Pair Group Method
with Arithmetic Mean (UPGMA) is a clustering or phenetic algorithm. It joins
tree branches based on the criterion of greatest similarity among pairs and
averages of joined pairs.
Neighbor Joining (NJ) algorithm is commonly applied with distance tree
building, regardless of the optimization of criterion. Fitch-Margobiash (FM)
method seeks to maximize the fit of the observed pairwise distances to a tree by
minimizing the squared deviation of all possible observed distances relative to
all possible path lengths on the tree. Minimum Evolution (ME) method seeks to
find the shortest tree that is consistent with the path lengths measured in a
manner similar to FM.
Character-based Methods
The character-based methods use character data at all steps in the analysis.
This allows the assessment of the reliability of each base position in an
Homology, Phylogeny and Evolutionary Trees 8.11
alignment on the basis of all other base positions. The principle of maximum
parsimony (MP) method is to search for a tree that requires the smallest
number of changes to explain the differences observed among the taxa under
study. The MP method defines an optimal tree as the one that postulates the
fewest mutations.
The principle of maximum likelihood (ML) method is to assume that
changes between all nucleotides (or amino acids) are equally probable leading
to reconstructions of likelihoods. ML method assigns quantitative probabilities
to mutational events rather than merely counting them. For each possible tree
topology, the assumed substitution rates are varied to find the parameters that
give the highest likelihood of producing the observed sequences. The optimal
tree is the one with the highest likelihood of generating the observed data.
Models
Phylogenetic tree-building methods presume particular evolutionary models.
Models inherent in phylogenetic methods have some important assumptions:
1. The sequence is correct and originates from a specified source.
2. The sequences are homologous (i.e. all are descended in some way from
a shared ancestral sequence).
3. Each protein in a sequence alignment is homologous with every other
in that alignment.
4. Each of the multiple sequence included in a common analysis has a
common phylogenetic history with the others (e.g. there are no mixtures
of nuclear and organellar sequences).
5. The sampling of taxa is adequate to resolve the problem of interest.
6. Sequence variation among the samples is representative of the broader
group of interest.
7. The sequence variability in the sample contains phylogenetic signal
adequate to resolve the problem of interest.
(a) (b)
a b c d e a b c d e
a 100 65 50 50 50 a 0 6 11 11 11
b 65 100 50 50 50 b 6 0 11 11 11
c 50 50 100 97 65 c 11 11 0 2 6
d 50 50 97 100 65 d 11 11 2 0 6
e 50 50 65 65 100 e 11 11 6 6 0
Fig. 8.5 Hypothetical (a) similarity table and (b) distance table for five organisms,
a-e. (Source: D.R. Westhead et al., Instant Notes: Bioinformatics, Bios Scientific Publishers
Ltd., 2003).
Procedure
The following steps can be followed to align sequences and construct the
phylogenetic tree using ClustalX:
1. Open ClustalX
2. Load sequence saved in the FASTA format (Entrez session) using the
file menu. Click the ClustalX yellow logo, click file> load the
sequence>enter. The dialogue box will appear. Give correct path, open
the sequence file and enter.
3. Scroll the match without alignment
4. Go to the alignment menu and click do complete alignment click > do
complete alignment>
5. Save the alignment files (*.dnd and *.aln)
6. Scroll again and see matches by noting the symbol code and the
histogram
7. Go to trees menu and click Tree’ then select >Draw N-J Tree. It will
create a tree file with .Ph extension. This file opens with NJ Plot.
8. Save the resultant tree file (*.ph)
9. Close ClustalX
8.14 Basic Bioinformatics
10. Open NJ Plot
11. Open the tree constructed using ClustalX (*.ph)
12. Observe the phylogenetic relationship between the sequences.
Human beta
Horse beta
Chimp Alpha
Human beta
Chimp beta
Horse beta
Human Alpha
Chimp Alpha
Horse Alpha
Fig. 8.6 Two trees generated from hemoglobin sequences from human, chimpanzee and
horse. The lower tree is correct, indicating the correct phylogeny for both α and
β hemoglobin chains. The upper tree is confusing because it is formed from human and
horse β chains and the chimpanzee α chain, creating impression that horse is closer to
human than chimpanzee (Source: D.R. Westhead et al., Instant Note: Bioinformatics, Bios
Scientific Publishers Ltd., 2003)
Macromolecular Sequences
Different macromolecular sequences evolve at different rates, even sequences
in different regions of the same molecule. Residues in an RNA or protein that
have a critical structural or functional role in the molecule can accommodate
mutations less easily than those in other regions. The rate at which a
particular sequence evolves depends largely on the proportion of residues
whose substitution would adversely affect normal structure and function.
Mitochondrial DNA
A useful macromolecular sequence for the study of primates is mitocondrial
DNA (mtDNA). As a consequence of respiratory metabolism, there is a higher
concentration of active oxygen species (such as superoxide and the hydroxyl
radical) in the mitochondria than in the nucleus and consequently a higher
chance of oxidative chemical lesions in mitochondrial DNA. Further, the
mtDNA polymerase is more error-prone than the nuclear enzyme. Therefore,
mtDNA evolves more quickly than nuclear DNA due to an increased intrinsic
mutation rate.
There is a short noncoding region in primate mtDNA where selective
constrains are low, since point mutations tend not to affect mitochondrial
function. This particular sequence evolves at a suitable rate to study primate
phylogeny. The tree in Fig. 8.3 is consistent with the alignment and clustering
of this region, and with such analyses of coding genes in mtDNA.
Ribosomal RNA
Ribosomal RNA (rRNA) is a highly conserved ubiquitous molecule in all
living organisms (animals, plants, fungi, bacteria, parasites, etc.). It has a low
tolerance for mutations and evolves very slowly. The abundant secondary
structure of rRNA insures that the rate of evolutionary change is slow, since
compensating base changes are required in double helical regions. The tree in
Figure 8.7 is consistent with the alignment and clustering of this molecule and
the conclusions are compatible with those of other macromolecular studies.
8.16 Basic Bioinformatics
Fig. 8.7 Major division of living things, derived by C. Woese on the basis of 15s RNA
sequences (Source: Lesk, A.M., Introduction to Bioinformatics, Oxford University Press)
STUDY QUESTIONS
1. What was the observation of Charles Darwin in Galapages finches?
2. How do you distinguish homology and similarity?
3. How do you distinguish ortholog, paralog and xenolog?
4. What are modules?
5. What is phylogeny?
Homology, Phylogeny and Evolutionary Trees 8.17
6. What is phenetic approach?
7. What is the special features of cladistics?
8. What is a node?
9. What is phylogenetic tree?
10. What is a rooted and unrooted tree?
11. What are the special features of phylogenetic tree?
12. What are the presumptions of phylogenetic tree-building?
13. What are the different methods used in phylogenetics?
14. How is molecular phylogenetics superior to traditional phylogenetics?
15. What are databases used in phylogenetic analysis?
16. What is bootstrapping?
C H A P T E R
Approaches
Usually thousands of chemical compounds are tested for drug action. One out
of 10,000 may hit the target. In this type of approach, no one knows initially
which target the drug attacks and the mechanism involved in the attack.
Rational approach starts from the clear knowledge of the target as well as the
mechanism by which it is to be attacked. Drug discovery involves finding the
target and arriving at the lead. Target refers to the causal agent of the disease
and lead refers to the active molecule which will interact with the causal agent.
When diseases are treated with drugs they interact with targets that
contribute to the disease and try to control their contribution thus producing
positive effects. The disease target may be endogenous (a protein synthesized
9.2 Basic Bioinformatics
by the individual to whom the drug is administered) or, in the case of
infectious diseases, may be produced by a pathogenic organism. Drugs act
either by stimulating or blocking the activity of the target protein.
Types of Targets
The targets for the drugs are usually the biomolecules, such as enzymes,
receptors or ion channels. The validity of the enzyme as a target depends upon
how much important it is for the survival of the pathogen. If it is less
significant, then the target has no value. If the drug target is located inside the
human system, the fluctuation of the target activity must correspond to the
fluctuation of the disease severity. Only when we are able to establish a high
level of significance in the regulation of the target for effective disease control,
the target will have relevance to the disease. Once the target is confirmed, we
can identify the modulators of the target. There are positive modulators and
negative modulators (Table 9.1).
Table 9.1: List of positive and negative modulators
Biomolecules Positive modulators Negative modulators
Enzymes Activators Inhibitors
Receptors Agonists Antagonists
Ion Channels Openers Blockers
Validation
Once the target is identified, it has to be validated. This process is called target
validation. It involves extensive testing of the target molecule’s therapeutic
potential. Validation may include the creation of animal disease models, and
the analysis of gene and protein expression data. By comparing the levels of
gene expression in normal and disease states, novel drug targets can be
identified in silico. Micro array technique can be used in this.
Drug Discovery and Pharmainformatics 9.3
Once the gene which is ‘up or down regulated’ (expressed in higher or
lower level than in normal tissue) in a disease state is identified, its nature can
be identified using bioinformatic tools. Similar genes or proteins can be traced
using BLAST from the sequence database. Similar genes and proteins will help
to deduce the function of the up or down regulated gene. If the target happens
to be one of a highly tractable structure class (such as receptors, enzymes or
ion channels), the drug designing will be easier.
A valid target must have a high therapeutic index, that is, a significant
therapeutic gain must be predicted through the use of such a drug. If a known
protein is the target, binding can be measured directly. A potential anti-
bacterial drug can be tested by its effect on growth of the pathogen. Some
compounds might be tested for effects on eukaryotic cells grown in tissue
culture. If a laboratory animal is susceptible to the disease, compounds can be
tested on animals.
Characters
If the target happens to be an enzyme, the following characters are studied: the
active site, the amino acids associated in the formation of active site, presence
or absence of metal component, number of hydrogen donors and acceptors
present in the active site, the topology of the active site, and the details about
hydrophobic and hydrophilic amino acids present in the active site.
If the target happens to be a biochemical substance or a substrate of an
enzyme, the following details are collected: size of the molecule, chemical
nature, groups that show hydrogen donor or acceptor capacity, its metabolic
byproducts and how this compound can be modified chemically.
Qualities
A lead molecule should have the following desirable qualities: (a) the potency
(able to modulate the target effectively), (b) solubility (it should be easily
soluble in water for quicker action), (c) a milder lipophilicity (ability to
penetrate plasma membrane), (d) metabolic stability (should not get destroyed
quickly inside the body; a longer shelf life is desirable), (e) bioavailability
(quicker absorption into the body and at the same time retained for longer time
for sustained activity), (f) specific protein binding, (g) less toxic or not at all
toxic.
9.4 Basic Bioinformatics
Finding Compounds
Lead compounds can be found using some of the following ways:
(i) Serendipity – through chance observations (discovery of penicillin by
Alexander Fleming).
(ii) Survey of natural sources – from traditional medicines (quinine from
Chincona bark).
(iii) Study of what is known about substrates or ligands or inhibitors and
the mechanism of action of the target protein, and select potentially
active compounds from these properties.
(iv) Trying drugs effective against similar diseases
(v) Large-scale screening of related compounds
(vi) Occasionally from side effects of existing drugs.
(vii) Screening of thousands of compounds.
(viii) Computer screening and ab initio computer design.
Stages
Trials are dived into several stages
Pre-clinical phase: Studies using animals
Phase I: Normal (healthy) human volunteers
Phase II: Evaluation of safety and efficacy in patients, and selection of dose regimen
Phase III: Large patient number study with placebo or comparator; at this stage regulatory approval is sought
and a commercial launch decision is taken
Phase IV: Long-term monitoring for adverse reactions reported by pharmacists and doctors.
Drug Discovery and Pharmainformatics 9.5
Other Inputs
Drug development has been benefiting much from genomics, proteomics,
combinational chemistry and high-throughput screening. Genomics and
proteomics have revolutionized the way target molecules are identified and
validated. Traditionally, drug targets have been characterized on an
individual basis and lead compounds have been sought with specific clinical
effects.
With the advent of genomics, particularly the availability of the entire
human genome sequence and its annotations, thousands of potential new
targets can now be identified by sequence, structure and function.
Bioinformatics is important not only because of its role in the analysis of
sequences and structures, but also in the development of algorithms for the
modeling of target protein interactions with drug molecules. This allows
rational drug design, in which protein structural data is used to predict the
type of ligands that will interact with a given target, and thus form the basis of
lead discovery.
Of late systematic methods are used to identify lead compounds. These
methods are based on high throughput screening in which lead discovery is
accelerated through the use of highly parallel assay formats, such as 96-well
plates. In turn, this requires the assembly of large chemical libraries for testing.
This has been made possible by combinational chemistry approaches, in
which large numbers of different compounds can be made by pooling and
dividing materials between reaction steps.
9.2 PHARMAINFORMATICS
The term pharmainformatics is often used to describe the mix of biology,
chemistry, mathematics and information technology required for data
processing and analysis in the pharmaceutical industry. The scope of
pharmainformatics is summarized in Table 9.2.
Table 9.2: Areas of biology and chemistry where informatics plays a vital role in the
drug discovery pipeline.
Biology
Genomics proteomics Target identification, validation in the human genome
(human genome project)
Characterization of human Cataloguing single nucleotide polymorphisms, and
genes and proteins association with drug response
patterns (pharmacogenomics)
Genomics, proteomics Target identification, validation in pathogens
(human pathogen genome projects).
Characterization of the genes and
proteins of organisms that are
pathogenic to human
Contd...
9.6 Basic Bioinformatics
Functional genomics (protein structure) Prediction of drug/ target interactions
Analysis of protein structures Rational drug design
(human and their pathogens)
Functional genomics (expression profiling) Gene classification based on drug responses
Determining gene expression patterns Pathway reconstruction
in disease and health
Functional genomics Databases of animal models
(genome-wide mutagnesis)
Determining the mutant phenotypes Target identification, validation
for all genes in the genome
Functional genomics (protein interactions) Characterization of protein interactions
Determining interactions among all proteins Reconstruction of pathways
Prediction of binding sites.
Chemistry
High throughput screening Storing, tracking and analyzing data
Highly parallel assay formats
for lead identification
Combinational chemistry Cataloguing chemical libraries.
Synthesis of large number of Assessing library quality, diversity
chemical compounds Predicting drug, target interactions
Tanimoto Coefficient
Usually library diversity is quantified using measures that compare the
properties of different molecules based on descriptors such as atomic position,
charge and potential to form different types of chemical bond. We can compare
two molecules using the Tanimoto coefficient (Tc), which evaluates the
similarity of fragments of each molecule.
The coefficient is calculated by the formula Tc = c/(a + b - c), where a is
the number of fragment–based descriptors in compound A, b is the number of
Drug Discovery and Pharmainformatics 9.7
fragment-descriptors in compound B, ad c is the number of shared fragment-
based descriptors. Hence, for identical molecules, Tc = 1, while for molecules
with no descriptors in common, Tc = 0. In a chemical library of ideal diversity,
most-pairwise comparisons would generate a Tanimoto coefficient near to
zero.
Pharmacophore
When we do not know much about the binding specificity of the target protein,
diverse libraries will be useful for lead discovery. When only some form of
sequence or structural information is available for the target, this can be used
to design focused libraries that concentrate on one region of chemical space.
For example, if the sequence of a particular target protein is known, then
database homology searching will often find a related protein whose structure
has been solved and whose interactions with small molecules have been
characterized. In these cases, it is possible to design a chemical library based
on particular molecular scaffold, which preserves a framework of sites present
in a known ligand, but which can be modified with diverse functional groups.
Some of these groups may have previously been shown to be important for
drug binding. Such sites are known as pharmacophores.
Tools
Many tools and resources are available for the design of combinatorial
libraries and the assessment of chemical diversity. A program called Selectors,
available from Tripos, allows the user to design very diverse libraries or
libraries focused on a particular molecular skeleton. Chem-x, developed by the
Oxford Molecular Group, allows the chemical diversity in a collection of
compounds to be measured and identifies all the pharmacophore.
ComibiLibMaker, another Tripos program, allows a virtual target.
Docking Algorithms
One of the most established docking algorithms is autodock. Another widely
used program is DOCK. Another program is CombiDOCK. In DOCK, the
arrangement of atoms at the binding site is converted into a set of spheres
called site points. The distances between the spheres are used to calculate the
exact dimensions of the binding site, and this is compared to a database of
chemical compounds. Matches between the binding site and a potential ligand
are given a confidence score, and ligands are then ranked according to their
total scores.
In combiDOCK, each potential ligand is considered as a scaffold
decorated with functional groups. Only spheres on the scaffold are initially
used in the docking prediction and then individual functional groups are
tested using a variety of bond torsions. Finally it is bumped before a final score
is presented.
Chemical databases can be screened not only with binding site
(searching for complementary molecular interactions) but also with another
ligand (searching for identical molecular interactions). Several available
algorithms can compare two-dimensional or three-dimensional structures and
build a profile of similar molecules.
Drug Discovery and Pharmainformatics 9.9
The three dimensional structure (3D) of the target is a prerequisite (X-ray
crystallography, nuclear magnetic resonance imaging) for designing a
compound that can bind or act on it. The compound is chosen from existing
chemical compound library by the combinatorial structure docking. The lead
compounds from the library are docked or tried by complementary fixing onto
the active site of the target molecule. This initial in silico fixing reduces the
number of compounds that have to be synthesized and tested in vitro, since the
databases contain the chemical property and method of synthesis of the
compounds.
In addition there are a few other commercial docking and molecular
modeling softwares which are described below:
Schroedinger
Schroedinger Software is a suite of computational tools specializing in
research for computational chemistry, docking, homology modeling, protein x-
ray crystallography refinement, bioinformatics, ADME prediction,
cheminformatics, enterprise informatics, pharmacophore searching, molecular
simulation, and quantum mechanics to solve real-world problems in life
science and molecular chemistry research. Maestro is the unified interface for
all Schroedinger software. Impressive rendering capabilities, a powerful
selection of analysis tools, and an easy-to-use design combine to make Maestro
a versatile modeling environment for all researchers. It can be used to build,
edit, run and analyse molecules.
The main comments are OPLS-AA, MMFF, GBSA solvent model,
conformational sampling, minimization, MD that includes the Maestro GUI
which provides visualization, molecule building, calculation setup, job
launching and monitoring, project-level organization of results and access to a
suite of other modeling programs (http://www.schrodinger.com/).
Molsoft
Molsoft is a leading provider of tools, databases and consulting services in the
area of structure prediction, structural proteomics, bioinformatics,
cheminformatics, molecular visualization and animation, and rational drug
design. Molsoft offers complete solutions customized for a biotechnology or
pharmaceutical company in the areas of computational biology and chemistry.
Molsoft is committed to continuous innovation, scientific excellence, the
development of the cutting edge technologies and original ideas. Molsoft is a
Powerful global optimizer in an arbitrary subset of internal variables, NOEs,
Protein docking, Ligand docking, Peptide docking, EM and Density placement
(http://www.molsoft.com/).
Discovery Studio
Discovery Studio is a well-known suite of software for simulating small
molecule and macromolecule systems. It is developed and distributed by
Accelrys, a company that specializes in scientific software products covering
9.10 Basic Bioinformatics
computational chemistry, computational biology, cheminformatics, molecular
simulations and Quantum Mechanics. It is typically used in the development
of novel therapeutic medicines, including small molecule drugs, therapeutic
antibodies, vaccines, synthetic enzymes, and even in areas such as consumer
products. It is used regularly in a range of academic and commercial entities,
but is most relevant to Pharmaceutical, Biotech, and consumer goods
industries.
The product suite has a strong academic collaboration programme,
supporting scientific research and makes use of a number of software
algorithms developed originally in the scientific community, including
CHARMM, MODELLER, DELPHI, ZDOCK, DMol3 and more (http://
accelrys.com/products/discovery-studio/).
VLifeMDS
VLifeMDS is a comprehensive and integrated software package for computer
aided drug design and molecular drug discovery process. This integrated suite
provides complete toolkit to scientists to perform all scientific functions with
its flexible architecture. VLifeMDS is ready to meet demands from a structure
based design approach as well as ligand based design approach while a
seamless integration between various modules within VLifeMDS allows a
hybrid approach for discovery projects.
With VLifeMDS users can access intuitive features for multiple activities
within a discovery project. The main objectives are active site analysis,
Homology modeling, pharmacophore identification, conformer generation,
combinatorial library, property visualization, Docking, QSAR analysis,
database querying and virtual screening (http://www.vlifesciences.com/
products/VLifeMDS/Product_VLifeMDS.php).
Drug Discovery and Pharmainformatics 9.11
Active Site Analysis
By studying the active site of the target molecule carefully, the lead compound
is built piece-by-piece using computer software. The surface of the target
molecule to be interacted by lead may have various chemical environments
such as hydrophobicity, hydrogen bonding or catalytic zone. To this field,
fragments of a hypothetical compound are placed. The orientation of the
fragments provides a clue about the final form of the lead compound.
GRID, GREEN, HISTE, HINT and BUCKTS are some of the softwares
used for this kind of active site analysis. Sometimes the entire molecule is fit
into the receptor site or active site. DOCK is a software that uses ‘shape fitting’
approach (Fig. 9.1- 9.1D). It searches all possible ways of fitting a ligand into
the receptor site. The binding site of the receptor or enzyme molecule contains
hydrogen bonding regions and hydrophobic regions.
Fig. 9.1A. Wire frame view of the docking molecules RmID (Rv3266c) (enzyme) and 11za
(ligand) before docking as observed in the Hex window.
Fig. 9.1B. Wire frame view of the very close contact between RmID (Rv3266c) (enzyme)
and 11za (ligand) before docking as observed in the Hex window.
9.12 Basic Bioinformatics
Fig. 9.1C. Harmonic surface view of the RmID (Rv3266c) (enzyme) and 11za (ligand)
after docking process is completed as observed in the Hex window.
Fig. 9.1D. The cartoon model of the RmID (Rv3266c) (enzyme) and 11za (ligand) complex
as observed in the Hex window.
A
List of Important Websites
and Web Addresses
http://www.arabidopsis.org/portals/expression/microarray/
microarraySoftwareV2.jsp
http://mbgd.genome.ad.jp/
http://www.genomesonline.org/cgi-bin/GOLD/index.cgi
http://microbialgenomics.energy.gov/databases.shtml
http://expasy.org/proteomics
https://wiki.nbic.nl/index.php/Proteomics_Tools
https://www.labkey.org/Project/home/CPAS/begin.view
http://www.genome.jp/kegg/pathway.html
http://www.ncbi.nlm.nih.gov/COG/
http://www.ncbi.nlm.nih.gov/unigene
http://pedant.gsf.de/
http://www.ebi.ac.uk/embl/
http://www.ebi.ac.uk/s4/summary/
molecular?term=STRING&classification=7227&tid=gSynFBgn0003525
General Links:
http://www.oxfordjournals.org/nar/database/a/
http://biosharing.org/biodbcore
http://www.ufrgs.br/favet/bioquimica/bioinf/bioinf_links.htm
http://www.colorado.edu/chemistry/bioinfo/BioinformaticsLinks.htm
http://pbil.univ-lyon1.fr/bookmarks.html
http://mbcf.dfci.harvard.edu/cmsmbr/biotools/biotools1.html
http://www.imb-jena.de/~rake/Bioinformatics_WEB/
proteins_purification.html
Databank Information
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks+-
id+19son1NJrIT
Glossary
15. Dayhoff, M.D. (Ed.), 1978. Atlas of Protein Sequence and Structure,
National Medical Research Foundation, Washington.
16. Durbin, R., Eddy, S., Krogh, A. and Mitchinson, G. (Eds.), 1998.
Biological Sequences Analysis Probabilities Models of Proteins and
Nucleic Acids, Cambridge University Press, Cambridge.
17. Dwyer, R.A., 2003. Genomic Perl: From Bioinformatics Basics to
Working Code, Cambridge University Press, New York.
18. Eidhammer, I., et al., 2004. Protein Bioinformatics: Algorithmic
Approach to Sequence and Structure Analysis, John Wiley & Sons,
New York.
19. Ewebs, W.J., 2004. Statistical Methods in Bioinformatics, Introduction,
Springer Verlag, Berlin.
20. Felsenstein, J., 2004. Inferring Phylogenies, Sinauer, Sunderland, MA.
21. Gibas, C. and Jambeck, P. 2001. Developing Bioinformatics Computer
Skills, O'Reilly, Shroff Publishers and Distributors Pvt. Ltd., Mumbai.
22. Greg Gibson and Muse Spenser, V., 2002. Primer of Genomic Science,
Sinuaer Associates Inc., Publishers, Sunderland.
23. Higgins, D. and Taylor, W. (Eds.), 2000. Bioinformatics: Sequence
Structure and Databanks a Practical Approach, Oxford University
Press, Oxford.
24. Hillis, D.M., Moritz, C. and Mable, B.K. (Eds.), 1996. Molecular
Systemics, Sinauer Associates Inc., Sunderland.
25. Jamison, C.D., 2004. Perl Programming for Bioinformatics and
Biologists, John Wiley & Sons, New York.
26. Jonathan Pevzner, 2003. Bioinformatics and Functional Genomics,
John Wiley & Sons, New York.
27. Khan, I.A. and Khanum, A. (Eds.), 2002. Fundamentals of
Bioinformatics, Ukaaz Publications, Hyderabad.
28. Khan, I.A. and Khanum, A. (Eds.), 2003. Essentials of Bioinformatics,
Ukaaz Publications, Hyderabad.
29. Khan, I.A. and Khanum, A. (Eds.), 2003. Recent Advances in
Bioinformatics, Ukaaz Publications, Hyderabad.
30. Kinser, J., 2009. Python for Bioinformatics, Jones and Bartlett
Publishers, London.
31. Krane, D.E. and Raymer, M.L., 2003. Fundamental Concepts of
Bioinformatics, Pearson Education Singapore Pte. Ltd., Singapore.
32. Krawetz, S.A. and Womble, D.D. (Eds.), 2003. Introduction to
Bioinformatics - Theoretical and Practical Approach, Humana Press,
Totawa.
33. Lacroix, Z. and Critchlow, T. (Eds.), 2003. Bioinformatics Managing
Scientific Data, Morgan Kaufmann.
34. Leach, A., 2001. Molecular Modeling, Prentice-Hall, London, England.
References R.3
35. Lengauer, T (Ed.), 2002. Bioinformatics from Genomes to Drugs, John
Wiley & Sons, New York.
36. Lengauer, T. (Ed.), 2007. Bioinformatics - From Genomes to Therapies,
Vols 1,2 and 3. Wiley-VCH Verlag Gmbh & Co, Germany
37. Leonard, J.B., 2000. Foundation of Structural Biology, Academic Press,
New York.
38. Lesk, A.M., 2003. Introduction to Bioinformatics, Oxford University
Press, Oxford.
39. Luke Alphe, 1997. DNA Sequencing: From Experimental Methods to
Bioinformatics, BIOS Scientific Publishers, Oxford.
40. Mani, K. and Vijayaraj, N., 2002. Bioinformatics - A Practical
Approach, Aparnaa Publications, Coimbatore.
41. Mani, K. and Vijayaraj, N., 2002. Bioinformatics for Beginners, (Ed.) D.
Padmanaban, Kalaikathir Achagam, Coimbatore.
42. Mishra, A., 2001. Bioinformatics and Human Genome, Authorspress
Publishers, Delhi, India.
43. Mount, D.M., 2004. Bioinformatics: Sequence and Genome Analysis,
2nd Ed. Cold Spring Harbor Laboratory Press, N.Y.
44. Mount, D.W., 2003. Bioinformatics, Sequence and Genome Analysis.
CBS.
45. Murthy, C.S.V., 2003. Bioinformatics, Himalaya Publishing House, New
Delhi.
46. Orengo, C., et al., 2003. Bioinformatics: Gene, Proteins and Computers,
BIOS Scientific Publishers, Oxford.
47. Pevzner, Pavel, 2000. Computational Molecular Biology - Al
Algorithmic Approach, The MIT Press, Cambridge, MA.
48. Racjard, D. (Ed.), 1997. Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids, Cambridge University Press,
Cambridge.
49. Rajadurai, M., 2010. Bioinformatics: A Practical Manual, PBS Book
Enterprises, Chennai.
50. Rashidi, H.H. and Buchler, L.K., 2000. Bioinformatics Basics,
Applications in Biological Science and Medicine, CRC Press, Florida,
USA.
51. Roy, D., 2009. Bioinformatics. Narosa Publishing House, New Delhi.
52. Sehomberc, D. and Lessel, U. (Eds.), 1995. Bioinformatics: From
Nucleic Acids and Proteins to Cell Metabolism, VCH.
53. Stephen misener and Stephen Krawetz, A. (Eds.), 2001. Bioinformatics
Methods and Protocols, Humana Press, Totowa.
54. Sundararajan, S. and Balaji, R., 2002. Introduction to Bioinformatics,
Himalaya Publishing House, New Delhi.
R.4 References
55. Thomas, E.C. , 1992. Proteins: Structures and Molecular Properties, 2nd
Ed., Freeman.
56. Tisdall, J.D., 2001. Beginning Perl for Bioinformatics, O'Reilly
Publishers.
57. Tisdall, J.D., 2001. Mastering Perl for Bioinformatics, O'Reilly
Publishers.
58. Waterman, M.S., 1995. Introduction to computational Biology: Maps,
Sequences and Genomics, Chapman and Hall, London.
59. Westhead, D.R., Parish, J.H. and Twyman, R.M., 2003. Instant Notes:
Bioinformatics, BIOS Scientific Publishers Ltd., Oxford, UK.
60. Wilkins, N.R. (Ed.), 1997. Proteome Research: New Frontiers in
Functional Genomics, Springer-Verlag, Berlin.
61. Yap, T.K., Ffrieder, O. and Martino, R.L., 1996. High Performance
Computational Methods for Biological Sequence Analysis, Kluwer
Academic, Norwell.
Index