Professional Documents
Culture Documents
Bioinformatics PDF
Bioinformatics PDF
Secondary article
Article Contents
David A Adler, Zymo Genetics Inc. and Department of Pathology, University of Washington,
Seattle, Washington, USA
. Introduction
. Scope of Bioinformatics
. Hardware
Introduction
Advances in classical as well as modern biology have often
been achieved by individuals presenting novel perspectives
of previously available observational information. The
elucidation of the genetic code following the publication of
Watson and Cricks model for the structure and replication
of DNA, along with the subsequent codication of the
central dogma of molecular biology (DNA is transcribed
into RNA which in turn is translated into protein)
exemplify the concept of biomolecules as information
carriers. This view leads naturally to the application of
computational approaches to the analysis of DNA and
protein sequence. In addition, the development of highthroughput technologies for generating biological and
biochemical data has contributed to a data explosion,
thereby increasing the diculty of simply examining all
data pertinent to a biological question. The need to
retrieve, organize and digest very large databases requires
the development of computational tools for data interaction and analysis. Bioinformatics is a discipline at the
intersection of computer science, information technology,
mathematics and biology and includes the study and
practice of archiving, searching, displaying, manipulating
and modelling biological data. Bioinformatics research
and development not only provides discovery tools for
other biologists but is making direct intellectual contributions to biology and medicine.
Bioinformatics is alternatively referred to as biocomputing or computational biology, the choice of term depending on the focus of activity. The practitioner may have a
background emphasizing any of the composite elds of
study and it is only recently that colleges and universities
have developed interdisciplinary programmes with the
goal of training bioinformatics professionals. An essential
part of the infrastructure of bioinformatics is a communications medium with fast data transfer rates and high
trac capacity to provide almost simultaneous information access to thousands of people. In the late twentieth
century the internet became that medium, and the principle
. Software
. Mapping and Linkage Analysis
. Biosequence Analysis
. Conclusion
Scope of Bioinformatics
Bioinformatics encompasses the study of a broad range
of biological data including gene maps, gene and
protein sequences and gene expression proles. A
primary goal of this data analysis is directed towards
unravelling the information content of biomolecules
and understanding how bioinformation directs the
development and function of living organisms. The
analysis of nucleic acid sequence, protein structure/
function relationships, genome organization, regulation
of gene expression, interaction of proteins and mechanisms
of physiological functions, can all benet from a bioinformatics approach. Nucleic acid and protein sequence data
from many dierent species and from population samplings provides a foundation for studies leading to new
understandings of evolution and the natural history of
humans.
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
Hardware
The computer is the basic tool of bioinformatics, utilized to
store, display and analyse data, and to design and construct
scientic models and simulations. Computer hardware
requirements are commonly dictated by the tasks needing
to be accomplished, the software available to do the job,
the computational intensity of the process and the degree
of interactivity desired. Modern personal computers have a
higher performance than the super computers of two
decades ago so that sophisticated programs can be run and
complex, interactive software is on the desktop. For jobs
that require more computational capacity such as rapid
searches of very large databases, protein modelling, threedimensional display and simulating the interaction of large
molecules, it may be necessary to employ current supercomputer class machines, providing high performance by
harnessing multiple central processing units (CPUs).
Hardware solutions (algorithms in silicon) designed to
perform a single computational task, such as extremely fast
searching, have also been developed. However, the high
cost of custom hardware for specic computational tasks
has limited their widespread application.
2
Software
Software for bioinformatics is as task-driven as the
hardware. Data descriptions, the types of searches and
analysis, how one needs to interact with the computer
(interface), and how the results are presented, all will
determine the choices of software for the task at hand.
Programs for the analysis of genetic and physical mapping
data, drawing pedigrees and evolutionary trees are
available from both commercial and academic sources.
Sequence analysis suites generally include programs for
assembling sequences, pattern or string searching, restriction analysis, motif identication, base or amino acid
composition analysis and protein characterization. There
are also individual programs for particular tasks such as
multiple sequence analysis, for example, Clustal W (see
Table 1), and for similarity searches of database such as
BLAST (Table 1). Institutions and schools often have
obtained site-licenses for software packages, which are
then made accessible for use on networked desktop
computers. Software for a wide variety of computational
tasks, which have been developed at academic institutions,
is often freely available for download via the internet.
Unfortunately sites come and go on the internet so it is
dicult to maintain lists of resources with associated links.
Instead of presenting a comprehensive list that will become
obsolete almost instantly, the reader is referred to several
stable, well-maintained sites as starting points for nding
biology-related software and documentation on the internet (Table 1).
A web browser has become another necessary bioinformatics tool, since the web medium is often the easiest
means of accessing data from remote networked databases. Particularly for very rapidly growing map and
sequence databases, it is not practical or appropriate to try
to maintain a local copy of the data. To ensure the accuracy
and timeliness of information from databases that are
updated daily it is necessary to be able to access those sites
directly. The search interfaces provided for interacting
with the major data repositories are powerful and fast,
delivering responses within seconds. Network server software often report results in hypertext format, facilitating
the further investigation of details and related information
from other databases. Regardless of the particular software one chooses for a task it is important to know the
program well enough to use it eciently, to maximize its
utility and to evaluate the signicance of computational
results.
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
Description
URLs
Clustal
http://bioinformer.ebi.ac.uk/newsletter/archives/
2/clustalw17.html
BLAST servers:
NCBI (USA)
EMBL (Germany)
NIH software
CMS Molecular Biology Resource
EXPASY SIB
Weizman Institute of Science
Biology Department
Indiana University
The Laboratory of Statistical Genetics Genetic mapping background and
Rockefeller University
resources
WIBR Mapmaker
Mapping software distribution
Centre dEtude du Polymorphisme
Humain (CEPH)
Mouse Genome Database
Jackson Laboratory
EUCIB
Radiation Mapping
EBI Stanford
Genome Database
OMIM
PDB
MapManager
http://www.ncbi.nlm.nih.gov/blast
http://dove.embl-heidelberg.de/Blast2
http://molbio.info.nih.gov/molbio/
software.html
http://www.sdsc.edu/ResTools/cmshp.html
http://www.expasy.ch/
http://bioinformatics.weizmann.ac.il/mb/
software.html
http://www.bio.indiana.edu/generalinfo/
bioresearch.html
http://linkage.rockefeller.edu
http://waldo.wi.mit.edu/ftp/distribution/
software/mapmaker3
http://www.cephb.fr
http://www.informatics.jax.org
http://www.hgmp.mrc.ac.uk/MBx/
MBxHomepage.html
http://www.ebi.ac.uk/RHdb
http://waldo.wi.mit.edu/ftp/distribution/
software/rhmapper
Human gene mapping database
http://www.gdb.org
Catalogue of human genes and genetic http://www.ncbi.nlm.nih.gov/omim
disorders
Protein Data Bank protein structure
http://www.rcsb.org/pdb
resource
Software suite for genetic mapping
http://mcbio.med.buffalo.edu/mapmgr.html
projects
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
translating DNA into protein, assembling partially overlapping fragments, analysing sequences, comparing sequences, and DNA motif discovery and recognition.
Current DNA sequencing technologies are not capable of
generating complete sequence for long nucleic acid
molecules in a single sequencing run and so it is necessary
to utilize computational methods to assemble contiguous
sequences from individual short sequence determinations.
If a large DNA molecule is randomly broken into smaller
pieces for the actual sequence determinations then a
contiguous linear sequence can be reconstructed by
aligning the overlapping portions from dierent random
fragments.
A common question arising when new genes are cloned
and sequenced is whether the sequence is already known or
does not occur in current databases. Answering this
question requires comparing the newly obtained sequence
to every sequence in the database. The algorithm of choice
for this task is the extremely rapid BLASTN algorithm
(Altschul et al., 1990). A list of all W-mers (contiguous
fragments of length W, which is typically set between 11
and 16), in the query sequence is rst compiled and then
every sequence in the database is in turn checked against
this list. This can be done rapidly and serves to rule out
most sequences from consideration. These regions are then
extended in either direction, using less stringent matching,
to form HSPs (high-scoring segment pairs). The expectation value of the HSP (the probability that an HSP of a
similar score will occur between two random sequences) is
computed and all database sequences having signicant
HSPs are reported. Overall database access time by
BLASTN is minimized by using a compressed form of
the nucleotide data and by using a memory-mapped le. It
is an algorithm highly amenable to parallelism and can be
compiled to run on multiprocessor hardware.
Biosequence Analysis
Development of the technologies to determine the linear
sequence of amino acids in proteins and the nucleotides in
DNA and RNA leads to the requisite need for compiling
and analysing sequence data. Sequence analysis is the
process of investigating the information content of linear
raw nucleic and protein sequence data.
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
Comparative modelling
Through the ages the human genome has been the target of
major evolutionary processes such as gene duplication,
gene fusion, gene rearrangement and gene deletion. The
individual gene has been subjected to the more subtle
process of base mutations that often change the protein
sequence of the gene product. Genes have evolved
substantially while still preserving the three-dimensional
6
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
Sequence motifs
Even if a protein family is divergent, it may be possible
to identify short regions that appear to have conserved
sequences and therefore locally conserved structure and
perhaps biochemical function. Each region can be
described using a motif that states, for each position,
the allowed variation in possible amino acids using a
distinct score for each. Dynamic programming algorithms
are used to align motifs to sequences. Motifs can be
viewed as compact expressions for a protein family, an
alternative to representing the family as a list of its
members. Furthermore, a match to a motif may not
be statistically signicant but may be biologically signicant because higher scores may not occur when
applying the motif to other protein families. If a
discriminating motif matches a sequence of unknown
structure it can be inferred that the sequence has the same
protein fold as the family.
The computational techniques used to create motifs
fall into four classes. The standard technique creates
the motif using variation observed in columns of a multiple
sequence alignment of the family. There are several ways
to compute motifs from a multiple alignment (Gribskov
et al., 1987; Tatusov et al., 1994). Other techniques are
machine learning algorithms that attempt to create motifs
without requiring a multiple alignment of the family
(Brazma et al., 1998). The hidden Markov model
techniques (Krogh et al., 1994) try to t available data to
a sequence of probability distributions using a local
optimization algorithm. Finally, some techniques are
iterative algorithms, which generalize a motif by repeated
searches of a sequence database (Tatusov et al., 1994) using
the evolving motif.
Fold recognition
Often a new protein sequence contains no recognizable
motifs, nor can its structure be inferred by comparative
modelling. In such cases, one can resort to fold recognition
approaches. The task of fold recognition is easily dened
but notoriously dicult to solve: for a given sequence,
determine which, if any, structures in the PDB are
compatible with the sequence.
Because the function of a protein is determined by
its three-dimensional structure, mutations causing
amino acid changes that grossly alter the structure of
the protein will usually inactivate the protein function and
will be selected against by evolution. It is for this reason
that, despite the vast space of protein sequences explored
by evolution over the ages, there probably exist only
several thousand unique protein topologies (Hubbard
et al., 1992). As the PDB continues to expand with new
solved protein structures, the chance that a new gene
product folds like a known structure will continue to
increase.
Fold recognition by threading is a new, powerful
technique. Threading methods are based upon the
assumptions that protein structures are in a state of
minimum free energy, and that this energy can be roughly
computed for any given structure. The energy computation
takes into account the compatibility of dierent amino
acids at each position in the structure. This compatibility
usually reects the preference of hydrophobic amino acids
in the core environment of the protein, and the potential
energy created when two amino acids are spatially close to
one another.
Given a function that can evaluate the compatibility of
a sequence with a structural template whose native
sequence has been removed, threading algorithms
attempt to minimize this function by considering various
possible sequence to structure alignments. The threading
task is enormously complex since exponentially many
(as a function of sequence and structure sizes) alignments
are possible, and the presence of arbitrarily many
pairwise interactions in a protein structure precludes
the use of dynamic programming alignment algorithms
to produce optimal solutions. There are two interesting
heuristic algorithms for obtaining at least a feasible
solution in the face of this complexity. One is the
approach of Jones et al. (1992) which uses a variant
of the standard dynamic programming algorithm.
Another is the statistical sampling approach of Madej
et al. (1995), which iteratively modies a working alignment until a local minima is reached. Both approaches
have had some success in predicting the fold of unknown
proteins, although low selectivity (proteins of dierent
structure appearing to be compatible) continues to be an
issue.
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net
Bioinformatics
Conclusion
As complete genome sequences become available and
many more protein structures are solved, new challenges
for bioinformatics are appearing. Investigators are just
beginning to address the questions of living organisms as
dynamic systems and these explorations will once again
expand the scope of bioinformatics. Advances in the
various arenas of bioinformatics holds the promise of
revolutionizing biological understanding and thereby
contributing to progress in preventing and treating disease.
References
Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990)
BLAST Basic Local Alignment Search Tool. Journal of Molecular
Biology 215: 403410.
Brazma A, Jonassen I, Eidhammer I and Gilbert D (1998) Approaches to
the automatic discovery of patterns in biosequences. Journal of
Computational Biology 5(2): 279305.
Chou KC (1995) A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space. Proteins 21(4):
319344.
Foote S, Vollrath D, Hilton A and Page DC (1992) The human Y
chromosome: overlapping DNA clones spanning the euchromatic
region. Science 258: 6066.
Gribskov M, McLachlan AD and Eisenberg D (1987) Prole analysis:
detection of distantly related proteins. Proceedings of the National
Academy of Sciences of the USA 84(13): 43554358.
Hubbard TJ, Ailey B, Brenner SE et al. (1992) SCOP: a structural
classication of proteins database. Nucleic Acids Research 27: 254
256.
Jones DT, Taylor WR and Thornton JM (1992) A new approach to
protein fold recognition. Nature 358: 8689.
Krogh A, Brown M, Mian IS, Sjolander K and Haussler D (1994)
Hidden Markov models in computational biology. Applications to
protein modeling. Journal of Molecular Biology 235: 15011531.
Further Reading
Baxevanis A and Ouellette BFF (eds) (1998) Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins. Chichester: John Wiley
and Sons.
Lesk AM (ed.) (1988) Computational Molecular Biology, Sources and
Methods for Sequence Analysis. Oxford: Oxford University Press.
Gribskov M and Devereux J (eds) (1991) Sequence Analysis Primer. New
York: Stockton Press.
Schuler GD, Boguski MS, Stewart EA et al. (1996) A gene map of the
human genome. Science 274: 540546. [http://www.ncbi.nlm.nih.gov/
genemap/]
Smith CM (1997) The CMS Molecular Biology Resource: Bio-Web
resources organized by analytical function. Trends in Genetics 13: 416.
[(1998) MolyBio. Science 281: 139]
Vogel F and Motulsky AG (1997) Human Genetics and Approaches.
Berlin: Springer.
von Heijne G (1987) Sequence Analysis in Molecular Biology, Treasure
Trove or Trivial Pursuit. San Diego: Academic Press.
ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net